Polars now allows you to write Parquet files even when the file is too large to fit in memory. It does this by using streaming to process data in batches and then writing these batches to a Parquet file with a method called sink_parquet
.
Streaming
In recent months Polars has added a streaming approach where larger-than-memory datasets can be processed in batches so that they fit into memory. If you aren’t familiar with streaming see this video where I process a 30 Gb dataset on a not-very-impressive laptop.
Sinking
However, a limitation of streaming has been that the output of the query must fit in memory as a DataFrame
.
Polars now has a sink_parquet
method which means that you can write the output of your streaming query to a Parquet file. This means that you can process large datasets on a laptop even if the output of your query doesn’t fit in memory.
In this example we process a large Parquet file in lazy mode and write the output to another Parquet file with sink_parquet
1
2
3
4
5
6
7
8
9
10
import polars as pl
(
pl.scan_parquet(large_parquet_file_path)
.groupby("passenger_count")
.agg(
pl.col("tip_amount")
)
.sink_parquet(output_parquet_file_path)
)
Unlike a normal lazy query we evaluate the query and write the output by calling sink_parquet
instead of collect
.
Converting CSV files
One great use case for sink_parquet
is converting one or more large CSV files to Parquet so the data is much faster to work with. This process can be tedious and time-consuming to do manually, but with Polars wWe can do this as follows:
1
2
3
4
(
pl.scan_csv("*.csv")
.sink_parquet(output_parquet_file_path)
)
There are some limitations to be aware of. Streaming does not work for all operations in Polars and so you may get an out-of-memory exceptions if your query does not support streaming.
You can learn more about sink_parquet
here in the official docs including options to set compression level and row group statistics.
Found this useful? Check out my Data Analysis with Polars course on Udemy with a half price discount
Next steps
Want to know more about Polars for high performance data science and ML? Then you can: