Polars can handle larger-than-memory datasets with its streaming mode. In this mode Polars processes your data in batches rather than all at once. However, the streaming mode is not some emergency switch that you should only hit when you run out of memory. For many queries streaming mode is as quick or quicker than non-streaming mode.
What this means is that it is worth keeping streaming switched on if you are working with larger datasets - particularly if you are building pipelines that you want to be ready to larger datasets in the future.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course
Simple example of streaming and non-streaming
To work in streaming mode we simply pass the streaming=True
argument to collect
when we evaluate a query.
First we create a DataFrame
with 1 million rows, 100 floating point columns and
an integer ID column.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import polars as pl
import numpy as np
N = 1_000_000
K = 100
df = (
pl.DataFrame(
np.random.standard_normal((N,K))
)
# Add an ID column
.hstack(
pl.DataFrame(
np.random.randint(0,9,(N,1)
)
)
.rename(
{'column_0':'id'}
)
)
)
We then do a groupby on the id
column and take the mean
of the remaingin columns. We execute the query in streaming mode with the streaming=True
argument
1
2
3
4
5
6
7
8
9
10
11
(
df
.lazy()
.groupby('id')
.agg(
pl.all().mean()
)
.collect(
streaming=True
)
)
If we compare this query with streaming=True
and streaming=False
(the default) I get an average of 75 ms for streaming and 120 ms for non-streaming. For Pandas this takes about 330 ms for comparison.
Takeaway
For many queries running in streaming mode may be a great default - rather than an emergency button that should only be hit when you are struggling with memory.
I’m not going to guarantee that streaming will always be at least as fast as non-streaming though, this is still a developing technology within Polars and there are surely use cases where streaming will be significantly slower. If you find such a case you are very welcome to discuss it on the Polars discord.
Also note that streaming is not supported for all operations in lazy mode at this point, but it does work for core operations such as groupby
and join
.
For more on streaming check out these other posts:
- writing large queries to Parquet with streaming
- working with multiple large files
or this video where I process a 30 Gb dataset on a not-very-impressive laptop.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: