Home Using the Polars GPU engine
Post
Cancel

Using the Polars GPU engine

Polars now has a GPU engine that has been developed with the RAPIDS team at NVIDIA. This was announced by the Polars team in this blog post. With this post I’m trying to give a less technical intro to the GPU engine and answer some questions I had from that post. These are questions like what the use case for the GPU engine is, what kind of operations does it help and how does it work? Read on to learn more…

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

What are the use cases and non-use cases for the GPU engine?

The use case for the GPU engine is to accelerate data processing compared to working on a CPU. Typically, operations that can be highly parallelised benefit most from GPU processing compared to CPU processing. The most important operations for a typical DataFrame user that benefit from GPU processing are group_by and join as the step of searching the rows to find the matching keys can be highly parallelised.

Benchmarking results show that there may be no benefit to using the GPU engine for small queries (where we take small to mean anything that runs in a few seconds or less) as there is some additional overhead in using the GPU. The relative performance benefit of the GPU engine also seems to increase as the scale of the data takes more advantage of the size of the GPU.

For Parquet files in particular, the GPU engine works better with bigger row group and page sizes. See my post here on the importance of row groups in Parquet files. When creating a Parquet file you can change the size of row groups from their default of 512^2 ~ 256k with the row_group_size argument to write_parquet.

The GPU engine is strictly for NVIDIA GPUs only. The GPU engine does not work on other kinds of GPUs e.g. those in an M-series Mac and likely never will as the processing is based on the CUDA language used on NVIDIA cards.

The GPU engine is not (at this stage) geared towards working with larger-than-memory datasets. In fact, a GPU may have less memory than the CPUs on a machine and so you may hit out-of-memory errors sooner with the GPU engine than with the standard engine. At this point we cannot use the Polars streaming engine (for processing larger-than-memory datasets in batches) in conjunction with the GPU engine.

So has Polars written their own CUDA implementation?

No - this would take years! Instead, the Polars devs have partnered with the existing cuDF developers. When your Polars code is executed on the GPU it happens via the cuDF package.

How does the GPU engine work?

Let’s compare what happens with a lazy query with the standard CPU engine and the GPU engine. First we run with the standard engine

1
2
3
4
5
6
df = (
    pl.scan_parquet('transactions.parquet')
    .group_by('customer_id')
    .agg(pl.col('amount').mean())
    .collect()
)

When we evaluate this query with collect Polars runs its Rust-based query optimizer to build the optimized query plan. This plan is then executed by the Polars Rust-based query engine on the CPU and the output is a Polars DataFrame.

Now we evaluate this query with the GPU engine by calling .collect(engine="gpu").

1
2
3
4
5
6
df = (
    pl.scan_parquet('transactions.parquet')
    .group_by('customer_id')
    .agg(pl.col('amount').mean())
    .collect(engine="gpu")
)

As before Polars runs its Rust-based query optimizer to build the optimized query plan. But this query plan is then handed off to the cuDF query engine. The cuDF query engine first confirms that it can execute the plan developed by Polars before executing the plan on the GPU. When cuDF is finished it passes the output data back to the CPU in Apache Arrow format and Polars creates a DataFrame from this data. From the perspective of the user there is no difference in the output they get.

To be clear, with the GPU engine the aim is for the entire query to be handled by RAPIDS including the I/O from the Parquet file. While I haven’t tested it yet this may mean that things like reading from the cloud work differently with the GPU engine as things like authentication need to be handled as well.

In the example above we work with a Parquet file via pl.scan_parquet. At present we can also do pl.scan_csv and pl.scan_ndjson and hope for GPU execution.

What if the GPU engine can’t execute the plan?

At this early stage not every operation in the Polars API will be supported by cuDF. In this case cuDF will tell Polars that it can’t execute the plan and so Polars will then try to execute the plan using its own query engine. At present this is all-or-nothing either cudf executes the whole plan or none of it. In future the Polars team would like to support partial execution.

At this early stage I don’t think it’s possible to determine which queries will run on the GPU engine beyond comparing query time in the standard and GPU engines.

Get in touch on social media if you have any questions or comments.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.