One of the most exciting new developments in the Polars world has been the release of the Polars GPU engine. With the GPU engine Polars users can accelerate their data pipelines with a tiny modification to their existing code. The GPU engine was announced by the developers in this blog post. In this post I look at what the use cases are for the GPU engine and how you can check if it will work for your pipelines.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
How does the GPU engine work in a nutshell?
While the Polars blog post linked to above does a great job of explaining how things work in detail, I thought a simplified summary would be useful. Essentially:
- you write a Polars lazy query as normal
- you execute the query with
collect(engine="gpu")
- your query is passed to the Polars query optimizer which optimizes it and creates a so-called Intermediate Representation (IR) of the optimized query plan
- the IR is checked by the cuDF library to confirm if it supports the operations contained in the optimized plan
- if cuDF confirms it can support the operations then cuDF executes its version of the plan on the GPU
- when cuDF finishes execution it transfers the data back from the GPU to create a normal Polars
DataFrame
- if cuDF cannot support the operations in the plan then Polars executes the plan with its standard Rust query engine in the CPU
I’ve sketched out the basic flow in this diagram.
If the GPU engine can execute the query then the full query is
If you want to confirm that your query is executed by the GPU engine you can use the following approach:
- turn on verbose mode using
pl.Config
- run your query with
collect(engine="gpu")
If cuDF can’t execute your query then a PerformanceWarning
is printed along with the operations that cannot currently be implemented by the GPU engine.
What are the use cases for the Polars GPU engine?
The core use case for the GPU engine is to reduce computation time comapared to working on a CPU. The GPU engine gives major performance benefits over processing on a CPU through enhanced parallelisation. Typically the greatest improvements over the CPU engine occur when the bottleneck for your query is computation rather than reading/write a file.
Typically, operations that can be highly parallelised benefit most from GPU processing compared to CPU processing.
The most important operations for a typical DataFrame
user that benefit from GPU processing are group_by
and join
as the computationally-intensive step of searching rows to find matching group keys or join keys can be highly parallelised.
The GPU engine is based on the cuDF library which is a dataframe library written in CUDA to run on NVIDIA GPUs.
The GPU engine is not (at this stage) geared towards working with larger-than-memory datasets. In fact, a GPU may have less memory than the CPUs on a machine and so you may hit out-of-memory errors sooner with the GPU engine than with the standard engine. At this point we cannot use the Polars streaming engine (for processing larger-than-memory datasets in batches) in conjunction with the GPU engine.
How does the GPU engine work?
Let’s compare what happens with a lazy query with the standard CPU engine and the GPU engine. First we run with the standard engine
1
2
3
4
5
6
df = (
pl.scan_parquet('transactions.parquet')
.group_by('customer_id')
.agg(pl.col('amount').mean())
.collect()
)
When we evaluate this query with collect
Polars runs its Rust-based query optimizer to build the optimized query plan. This plan is then executed by the Polars Rust-based query engine on the CPU and the output is a Polars DataFrame
.
Now we evaluate this query with the GPU engine by calling .collect(engine="gpu")
.
1
2
3
4
5
6
df = (
pl.scan_parquet('transactions.parquet')
.group_by('customer_id')
.agg(pl.col('amount').mean())
.collect(engine="gpu")
)
As before Polars runs its Rust-based query optimizer to build the optimized query plan. But this query plan is then handed off to the cuDF query engine. The cuDF query engine first confirms that it can execute the plan developed by Polars before executing the plan on the GPU. When cuDF is finished it passes the output data back to the CPU in Apache Arrow format and Polars creates a DataFrame
from this data. From the perspective of the user there is no difference in the output they get.
To be clear, with the GPU engine the aim is for the entire query to be handled by RAPIDS including the I/O from the Parquet file. While I haven’t tested it yet this may mean that things like reading from the cloud work differently with the GPU engine as things like authentication need to be handled as well.
In the example above we work with a Parquet file via pl.scan_parquet
. At present we can also do pl.scan_csv
and pl.scan_ndjson
and hope for GPU execution.
What if the GPU engine can’t execute the plan?
At this early stage not every operation in the Polars API will be supported by cuDF. In this case cuDF will tell Polars that it can’t execute the plan and so Polars will then try to execute the plan using its own query engine. At present this is all-or-nothing either cudf executes the whole plan or none of it. In future the Polars team would like to support partial execution.
At this early stage I don’t think it’s possible to determine which queries will run on the GPU engine beyond comparing query time in the standard and GPU engines.
Get in touch on social media if you have any questions or comments.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
Next steps
Want to know more about Polars for high performance data science? Then you can: