Polars (and Apache Arrow) has been designed to be careful with your data so you don’t get surprises like the following Pandas code where the ints
column has been cast to float because of the missing value
1
2
3
4
5
df = pd.DataFrame({'ints':[None,1,2],'strings':['a','b','c']})
ints strings
0 NaN a
1 1.0 b
2 2.0 c
However, every big library will do something that some users won’t expect. These are commonly referred to as gotchas. In this post we explore some of the few gotchas relating to ordering outputs from group_by
and unique
that I found while writing my course.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
Ordering of groupby
Let’s define a simple DataFrame
and do a group_by
aggregation
1
2
3
4
5
6
7
8
9
10
11
12
13
df = pl.DataFrame(
{
"color": ["red", "green", "green", "red", "red"],
"value": [0, 1, 2, 3, 4],
}
)
(
df
.group_by("color")
.agg(
pl.col("value").count()
)
)
If we run this we might get the following output:
1
2
3
4
5
6
7
8
9
shape: (2, 2)
┌───────┬───────┐
│ color ┆ value │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═══════╡
│ green ┆ 2 │
│ red ┆ 3 │
└───────┴───────┘
Fine - so the groups are ordered alphabetically, right?
Well no - run this a few more times and we will eventually get the following output with a different order of rows:
1
2
3
4
5
6
7
8
9
shape: (2, 2)
┌───────┬───────┐
│ color ┆ value │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═══════╡
│ red ┆ 3 │
│ green ┆ 2 │
└───────┴───────┘
We see the order of group_by
output isn’t fixed either alphabetically or by the order of the inputs. This can be an issue if we want to ensure we get consistent ordering - for example when writing tests.
If we want to get a consistent output we have two choices. The first is to pass the maintain_order = True
argument to group_by
:
1
2
3
4
5
6
7
(
df
.group_by("color",maintain_order = True)
.agg(
pl.col("value").count()
)
)
Setting maintain_order = True
ensures that the order of the groups is consistent with the order of the input data. However, using maintain_order = True
prevents Polars from using the streaming engine for larger-than-memory data.
The second solution is to call sort
on the output to impose an ordering on the groups
1
2
3
4
5
6
7
8
(
df
.group_by("color")
.agg(
pl.col("value").count(),
)
.sort("color")
)
As sort
is now available in the streaming engine this solution can also run in streaming mode.
Ordering of unique
We use unique
to get the distinct rows of a DataFrame
in relation to some columns. In this example we define a simple DataFrame
where we define unique values by the color
and value
columns and track row order with the row
column
1
2
3
4
5
6
7
8
df = pl.DataFrame(
{
"color": ["red", "green", "red", "green", "red"],
"value": [0, 1, 0, 1, 2],
"row":[0,1,2,3,4]
}
)
df.unique(subset=["color","value"])
Run it once and we might get output like this:
1
2
3
4
5
6
7
8
9
10
shape: (3, 3)
┌───────┬───────┬─────┐
│ color ┆ value ┆ row │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═════╡
│ red ┆ 0 ┆ 0 │
│ green ┆ 1 ┆ 1 │
│ red ┆ 2 ┆ 4 │
└───────┴───────┴─────┘
In earlier versions (i.e. before v0.17.0) of Polars we would have got this order every time.
This was becasue the unique
method behaved differently from group_by
in that maintain_order
was set to True
. This has now changed - maintain_order
is set to False
by default and so the output of unique
is no longer ordered by the input DataFrame
. This means the output above could also, for example, be
1
2
3
4
5
6
7
8
9
10
shape: (3, 3)
┌───────┬───────┬─────┐
│ color ┆ value ┆ row │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═════╡
│ green ┆ 1 ┆ 1 │
│ red ┆ 0 ┆ 0 │
│ red ┆ 2 ┆ 4 │
└───────┴───────┴─────┘
In some ways the previous ordered behaviour was intuitive as we often think of unique
as returning the input DataFrame
without the duplicate rows. However, as with group_by
having a default of mantain_order = True
would mean that unique
would not work by default in streaming mode by default.
Maintaining order is not streaming-friendly as it requires bringing together all the chunks in memory to compare the order of the rows.
With this change of default the developers want to ensure that Polars is ready to work with datasets of all sizes while allowing users to choose different behaviour if desired.
A related point is the choice of which row within each duplicated group is kept by unique
. In Pandas this defaults to the first row of each duplicated groups. In Polars the default is any
as this again allows more optimizations.
Takeaway
When the order of outputs is important to you be aware if there is a maintain_order
argument. Some other functions that have this include:
partition_by
pivot
upsample
andcut
(applied to a Series)
For more on related topics check out these posts:
- faster groupby on sorted data
- using streaming mode defensively
- writing large queries to Parquet with streaming
- working with multiple large files
or this video where I process a 30 Gb dataset on a not-very-impressive laptop.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: