Published on: 3rd October 2022
Getting the largest values with Polars
This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters
When I’ve wanted to get the largest values in a dataframe I’ve always sorted the columns and then called .head
.
That’s not the best way of doing this however. For the sort
you need to compare all the values even though it’s just the small number of top values you really need to compare with.
The solution in Polars turns out to be the top_k
method.
It has the same output as .sort.head
, but is faster because it only cares about comparisons with the largest (or smallest) values.
In this simple example top_k
is 2x faster than .sort.head
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import polars as pl
import numpy as np
# Create a random DataFrame
N = 100000
dfNumeric = pl.DataFrame(np.random.standard_normal((N,100)))
# Top 3 values with .sort and .head
dfNumeric.select(pl.all().sort(reverse=True).head(3))
# Time: 180 ms
# Top 3 values with top_k
dfNumeric.select(pl.all().top_k(5))
# Time: 90 ms
In general top_k
scales as O(n * log(k))
whereas sorting the whole list scales as O(n * lon(n))
. So using top_k
makes a bigger difference as the difference between the number of elements you want compared to the total number of elements you have.
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- check out my Polars course on Udemy
- follow me on bluesky
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.