Sinking larger-than-memory Parquet files

Polars now allows you to write Parquet files even when the file is too large to fit in memory. It does this by using streaming to process data in batches and then writing these batches to a Parquet file with a method called sink_parquet.

Streaming

In recent months Polars has added a streaming approach where larger-than-memory datasets can be processed in batches so that they fit into memory. If you aren’t familiar with streaming see this video where I process a 30 Gb dataset on a not-very-impressive laptop.

Sinking

However, a limitation of streaming has been that the output of the query must fit in memory as a DataFrame.

Polars now has a sink_parquet method which means that you can write the output of your streaming query to a Parquet file. This means that you can process large datasets on a laptop even if the output of your query doesn’t fit in memory.

In this example we process a large Parquet file in lazy mode and write the output to another Parquet file with sink_parquet

        
      
import polars as pl

(
    pl.scan_parquet(large_parquet_file_path)
    .groupby("passenger_count")
    .agg(
        pl.col("tip_amount")
        )
    .sink_parquet(output_parquet_file_path)
)

Unlike a normal lazy query we evaluate the query and write the output by calling sink_parquet instead of collect.

Converting CSV files

One great use case for sink_parquet is converting one or more large CSV files to Parquet so the data is much faster to work with. This process can be tedious and time-consuming to do manually, but with Polars wWe can do this as follows:

        
      
(
    pl.scan_csv("*.csv") 
    .sink_parquet(output_parquet_file_path)
)

There are some limitations to be aware of. Streaming does not work for all operations in Polars and so you may get an out-of-memory exceptions if your query does not support streaming.

You can learn more about sink_parquet here in the official docs including options to set compression level and row group statistics.

Found this useful? Check out my Data Analysis with Polars course on Udemy with a half price discount

Next steps

Want to know more about Polars for high performance data science and ML? Then you can:

Sinking larger-than-memory Parquet files

Streaming

Sinking

Converting CSV files

Next steps

Further Reading

What is a Polars expression?

Reading from S3 with Polars (or DeltaLake) using AWS SSO

Fitting linear models within Polars