One of the most common questions we get on the Polars discord is how to filter rows in one dataframe by values in another.
I think people don’t realise this is a basically a join because they don’t want any of the columns from the other dataframe.
The good news is that functionality exists - it’s called either:
- a semi join if you want to keep rows also found in the other dataframe or
- an anti join if you want to remove rows also found in the other dataframe.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course
Here’s the example I made for the Polars user guide…
For a rental car company we have a cars dataframe with cars we own and their unique IDs…
1
2
3
4
5
6
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
1
2
3
4
5
6
7
8
9
10
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
│ c ┆ bmw │
└─────┴────────┘
along with a repairs dataframe that tracks repairs made to each car.
1
2
3
4
5
6
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
1
2
3
4
5
6
7
8
9
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
│ c ┆ 200 │
└─────┴──────┘
We want to find either:
- all cars that have had repairs (the semi join) or
- all cars that had not had repairs (the anti join)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
└─────┴────────┘
Notice that with these joins we don’t end up with any columns from the repairs dataframe.
The semi and anti joins have standard join advantages. For example you can do these joins based on conditions over multiple columns. In Polars you can do them in Polars lazy mode and with streaming for large data.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: