In a previous post we saw that Polars has fast-track algorithms for calculating some statistics on sorted data. In this post we see that Polars also has a fast-track algorithm for getting groupby
keys on a sorted column.
Performance comparison
In this example we create a DataFrame
with an ID column that we know is sorted. When generating the DataFrame
below we set N = 10_000_000
to create a DataFrame
with 10 million rows.
We also set the cardinality
to be 10 - this means that there are 10 distinct IDs that we group on. The cardinality
is a key variable as we see below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import polars as pl
import numpy as np
N = 10_000_000
cardinality = 10
# Create a sorted array of id integers
sorted_array = np.sort(np.random.randint(0,cardinality,N))
# Create a DataFrame with a sorted ID column and a random values column
df = (
pl.DataFrame(
{
"id":[i for i in sorted_array],
"values":np.random.standard_normal(N)
}
)
)
For this comparison we groupby
on the id
column and take the mean of the values
column.
1
2
3
4
5
6
7
(
df
.groupby("id")
.agg(
pl.col("values").mean()
)
)
At this point Polars doesn’t know that the id
column is sorted and so won’t use the fast-track algorithm. If we do the groupby-aggregation above using the standard non-sorted algorithm it takes 70 ms.
As we saw in the previous post we tell Polars that the id
column is sorted using the set_sorted
expression.
1
2
3
4
5
6
df = (
df
.with_column(
pl.col("id").set_sorted()
)
)
If we run the groupby-aggregation again on the column that Polars knows is sorted it takes 14 ms - that is 5 times faster that the standard algorithm.
Cardinality is king
So we see that groupby-aggregations on sorted columns are much faster when Polars knows the column is sorted, right? Well, it’s not quite that simple…
In the example above we set cardinality = 10
so in the 10 millions rows there are 10 distinct group IDs.
If we create a new DataFrame
with cardinality = 100_000
we find that the normal algorithm takes 90 ms and the fast-track sorted algorithm takes 60 ms. In this case the fast-track algorithm is just 1.5 times faster.
If we set cardinality = 1_000_000
we find the two approaches take a similar amount of time (or the normal approach is slightly faster).
The reason for this convergence is that the fast-track algorithm doesn’t look through the id
column element-by-element but instead tries to find chunks of the id
column with the same value at the start and end. When the algorithm finds these chunks it can assume the whole chunk has the same value because it knows the column is sorted.
When cardinality is low there are many such uniform chunks and the fast-track algorithm can speed through the id
column. As cardinality increases the length of the uniform chunks decreases until eventually the fast-track algorithm is essentially doing an element-by-element search. In this case the fast-track algorithm is still trying to take jumps through the column before doing element-by-element search so the sorted approach may even be slower when the cardinality is high.
With this caveat in mind this is still a powerful optimisation that you can apply if you know your data is sorted and the cardinality isn’t too high. I also think that there are further optimisations that can be applied to speed things up when cardinality is higher.
Found this useful? Get in touch if you would like to discuss how your pipelines can take full advantage of Polars speed and scalability.
Or check out my Data Analysis with Polars course with a half price discount
Next steps
Want to know more about Polars for high performance data science and ML? Then you can: