Home Polars doesn't have an index but what if you want one? Or many?
Post
Cancel

Polars doesn't have an index but what if you want one? Or many?

Published on: 21st October 2022

Polars doesn’t have an index but what if you want one. Or many?

This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters

Polars doesn’t have an index. That’s great because it saves a lot of time setting and resetting indices.

But what if you want fast access to subsets of the data that you access often?

Partitioning a DataFrame

One way to solve this is to partition your dataframe.

In Polars we can do this with partition_by.

We tell it which column(s) we want to partition by and Polars creates a dictionary that maps from the unique values to a DataFrame with the corresponding rows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import polars as pl

df = pl.DataFrame(
  {"keys":["a","b","a"],"values":[0,1,2]}
)

# Create a partition mapping on the keys column
partitions_dict = (
  df
  .partition_by("keys",as_dict=True)
)

# Access the sub-DataFrame for key "a"
partitions_dict["a"]
shape: (2, 2)
┌──────┬────────┐
 keys  values 
 ---   ---    
 str   i64    
╞══════╪════════╡
 a     0      
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
 a     2      
└──────┴────────┘

The dataframes in the partition_by mapping are copies, so this is memory intensive if you want to partition in multiple ways.

Using groupby

But there is an alternative - you can use a humble groupby.

What is a groupby? It’s a mapping from keys to row indices as captured in the agg_groups expression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import polars as pl

df = pl.DataFrame(
  {
   "keys1":["a","b","a"],
   "keys2":["c","c","d"],
   "values":[0,1,2]}
)

# Create a groupby mapping on the keys1 column
keys1_groups = (
	df
    .groupby("keys1")
    .agg(pl.col("values").agg_groups().alias("groups"))
)
shape: (2, 2)
┌───────┬───────────┐
 keys1  groups    
 ---    ---       
 str    list[u32] 
╞═══════╪═══════════╡
 a      [0, 2]    
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
 b      [1]       
└───────┴───────────┘

One advantage of using groupby is that it’s cheap memory wise - basically the equivalent of a single extra 32-bit integer column to store the row indices for each mapping.

And unlike a Pandas Index or MultIindex you can define lots of different group indices at the same time.

The last step is to make a fast mapping from keys to the sub-DataFrame for that key.

To do this we unwrap the groupby._groups DataFrame to a dictionary and then we can access our sub-DataFrame quickly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
df = pl.DataFrame(
  {
   "keys1":["a","b","a"],
   "keys2":["c","c","d"],
   "values":[0,1,2]}
)

# Create a groupby mapping on the keys1 column
keys1_groups = (
	df
    .groupby("keys1")
    .agg(pl.col("values").agg_groups().alias("groups"))
)
# Convert to a dictionary from keys to row indices
keys1_dict = {
    el["keys1"]:el["groups"] for el in keys1_groups.to_dicts()
}
# Set the key of interest
key = "a"
# Get the sub-DataFrame for that key
(
    df[
        keys1_dict[key]
    ]
)
shape: (2, 3)
┌───────┬───────┬────────┐
 keys1  keys2  values 
 ---    ---    ---    
 str    str    i64    
╞═══════╪═══════╪════════╡
 a      c      0      
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
 a      d      2      
└───────┴───────┴────────┘

So there you have it - a Postgres-style index* on a DataFrame.

*Well obviously there’s more to a Postgres index but this will work for lots of use cases!

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

This post is licensed under CC BY 4.0 by the author.