Ordering of groupby and unique in Polars

Polars (and Apache Arrow) has been designed to be careful with your data so you don’t get surprises like the following Pandas code where the ints column has been cast to float because of the missing value

        
      
df = pd.DataFrame({'ints':[None,1,2],'strings':['a','b','c']})
   ints strings
0   NaN       a
1   1.0       b
2   2.0       c

However, every big library will do something that some users won’t expect. These are commonly referred to as gotchas. In this post we explore some of the few gotchas relating to ordering outputs from group_by and unique that I found while writing my course.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

Ordering of groupby

Let’s define a simple DataFrame and do a group_by aggregation

        
      
df = pl.DataFrame(
    {
        "color": ["red", "green", "green", "red", "red"],
        "value": [0, 1, 2, 3, 4],
    }
)
(
    df
    .group_by("color")
    .agg(
        pl.col("value").count()
    )
)

If we run this we might get the following output:

        
      
shape: (2, 2)
┌───────┬───────┐
│ color ┆ value │
│ ---   ┆ ---   │
│ str   ┆ u32   │
╞═══════╪═══════╡
│ green ┆ 2     │
│ red   ┆ 3     │
└───────┴───────┘

Fine - so the groups are ordered alphabetically, right?

Well no - run this a few more times and we will eventually get the following output with a different order of rows:

        
      
shape: (2, 2)
┌───────┬───────┐
│ color ┆ value │
│ ---   ┆ ---   │
│ str   ┆ u32   │
╞═══════╪═══════╡
│ red   ┆ 3     │
│ green ┆ 2     │
└───────┴───────┘

We see the order of group_by output isn’t fixed either alphabetically or by the order of the inputs. This can be an issue if we want to ensure we get consistent ordering - for example when writing tests.

If we want to get a consistent output we have two choices. The first is to pass the maintain_order = True argument to group_by:

        
      
(
    df
    .group_by("color",maintain_order = True)
    .agg(
        pl.col("value").count()
        )
)

Setting maintain_order = True ensures that the order of the groups is consistent with the order of the input data. However, using maintain_order = True prevents Polars from using the streaming engine for larger-than-memory data.

The second solution is to call sort on the output to impose an ordering on the groups

        
      
(
    df
    .group_by("color")
    .agg(
        pl.col("value").count(),
        )
    .sort("color")
)

As sort is now available in the streaming engine this solution can also run in streaming mode.

Ordering of `unique`

We use unique to get the distinct rows of a DataFrame in relation to some columns. In this example we define a simple DataFrame where we define unique values by the color and value columns and track row order with the row column

        
      
df = pl.DataFrame(
    {
        "color": ["red", "green", "red", "green", "red"],
        "value": [0, 1, 0, 1, 2],
        "row":[0,1,2,3,4]
    }
)
df.unique(subset=["color","value"])

Run it once and we might get output like this:

shape: (3, 3)
┌───────┬───────┬─────┐
│ color ┆ value ┆ row │
│ ---   ┆ ---   ┆ --- │
│ str   ┆ i64   ┆ i64 │
╞═══════╪═══════╪═════╡
│ red   ┆ 0     ┆ 0   │
│ green ┆ 1     ┆ 1   │
│ red   ┆ 2     ┆ 4   │
└───────┴───────┴─────┘

In earlier versions (i.e. before v0.17.0) of Polars we would have got this order every time.

This was becasue the unique method behaved differently from group_by in that maintain_order was set to True. This has now changed - maintain_order is set to False by default and so the output of unique is no longer ordered by the input DataFrame. This means the output above could also, for example, be

shape: (3, 3)
┌───────┬───────┬─────┐
│ color ┆ value ┆ row │
│ ---   ┆ ---   ┆ --- │
│ str   ┆ i64   ┆ i64 │
╞═══════╪═══════╪═════╡
│ green ┆ 1     ┆ 1   │
│ red   ┆ 0     ┆ 0   │
│ red   ┆ 2     ┆ 4   │
└───────┴───────┴─────┘

In some ways the previous ordered behaviour was intuitive as we often think of unique as returning the input DataFrame without the duplicate rows. However, as with group_by having a default of mantain_order = True would mean that unique would not work by default in streaming mode by default. Maintaining order is not streaming-friendly as it requires bringing together all the chunks in memory to compare the order of the rows.

With this change of default the developers want to ensure that Polars is ready to work with datasets of all sizes while allowing users to choose different behaviour if desired.

A related point is the choice of which row within each duplicated group is kept by unique. In Pandas this defaults to the first row of each duplicated groups. In Polars the default is any as this again allows more optimizations.

Takeaway

When the order of outputs is important to you be aware if there is a maintain_order argument. Some other functions that have this include:

partition_by
pivot
upsample and
cut (applied to a Series)

For more on related topics check out these posts:

faster groupby on sorted data
using streaming mode defensively
writing large queries to Parquet with streaming
working with multiple large files

or this video where I process a 30 Gb dataset on a not-very-impressive laptop.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )

Next steps

Want to know more about Polars for high performance data science? Then you can:

Ordering of groupby and unique in Polars

Ordering of groupby

Ordering of unique

Takeaway

Next steps

Further Reading

What is a Polars expression?

Reading from S3 with Polars (or DeltaLake) using AWS SSO

Fitting linear models within Polars

Ordering of `unique`