Getting started with Polars? This post shows you how to convert some familar Pandas commands to Polars. But it also tries to go beyond that to introduce you to some of the more fundamental differences between Pandas and Polars.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

Did you know - this is the only Pandas to Polars cheatsheet kept under automated testing! I test these code snippets along with the thousands of other code snippets in my course to make sure they are up to date and correct with the latest versions of both libraries.

We first create a sample dataset in Polars

        
      
import polars as pl
import polars.selectors as cs

import pandas as pd
import numpy as np

df = pl.DataFrame(
    {
        'grp': [1, 2, 1, 2, 1, 2],
        'x': list(range(6, 0, -1)),
        'y': list(range(4, 10)),
        'z': [3, 4, 5, 6, 7, None],
        "ref" : list('abcdef')
    }
)
shape: (6, 5)
┌─────┬─────┬─────┬──────┬─────┐
│ grp ┆ x   ┆ y   ┆ z    ┆ ref │
│ --- ┆ --- ┆ --- ┆ ---  ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64  ┆ str │
╞═════╪═════╪═════╪══════╪═════╡
│ 1   ┆ 6   ┆ 4   ┆ 3    ┆ a   │
│ 2   ┆ 5   ┆ 5   ┆ 4    ┆ b   │
│ 1   ┆ 4   ┆ 6   ┆ 5    ┆ c   │
│ 2   ┆ 3   ┆ 7   ┆ 6    ┆ d   │
│ 1   ┆ 2   ┆ 8   ┆ 7    ┆ e   │
│ 2   ┆ 1   ┆ 9   ┆ null ┆ f   │
└─────┴─────┴─────┴──────┴─────┘

For more posts like this check out: -Pandas to Polars time series differences

Accessing data in a `DataFrame`

There are two ways to access data in a Polars DataFrame:

using square brackets with [] (other called “indexing”) and
using the expression API with methods like filter, select and with_columns

These square bracket and expression API approaches have different use cases. The basic rule is that you should use the expression API unless you are doing a one-off operation such as:

inspecting the values of some rows or columns
converting a DataFrame column to a Series

In these cases use the [] approach.

The expression API is more powerful than the [] approach because:

operations with the expression API are run in parallel and
operations within the expression API can be optimised in lazy mode

Accessing data using the expression API

Selecting and transforming a `DataFrame`

Operation	Pandas	Polars
Select a subset of columns	`df[["x","y"]]`	`df.select("x","y")`
Select and transform columns	`df[["x","y"]].astype(float)`	`df.select(pl.col("x","y").cast(pl.Float64))`
Add a column from a constant	`df["w"] = 1`	`df.with_columns(w = pl.lit(1))`
		`df.with_columns(pl.lit(1).alias("w"))`
Add a column from a list	`df["w"] = list(range(1,7))`	`df.with_columns(pl.Series("w",list(range(1,7))))`
Add a column from other columns	`df["w"] = df["x"] + df["y"]`	`df.with_columns(w = pl.col("x") + pl.col("y"))`
Change dtype of a column	`df.assign(x = lambda df: df["x"].astype("float"))`	`df.with_columns(pl.col("x").cast(pl.Float64))`
Rename a column	`df.rename(columns={"x":"x2"})`	`df.rename({"x":"x2"})`
Drop columns	`df.drop(columns=["x","y"])`	`df.drop(["x","y"])`
		`df.drop("x","y")`
Sorting	`df.sort_values("x")`	`df.sort("x")`
Copying	`df.copy()`	`df.clone()`

In these examples we’ve typed out the column names explicitly. Polars also has lots of ways to select multiple columns based on their dtype or patterns in their names.

Select	Polars
All columns	`df.select(pl.all())`
All integer columns	`df.select(pl.col(pl.INTEGER_DTYPES))`
… except ‘x’	`df.select(pl.col(pl.INTEGER_DTYPES).exclude('x'))`
All integer columns	`df.select(cs.integer())`
By regex	`df.select(cs.matches('x\\|y'))`
By name pattern	`df.select(cs.starts_with('r'))`

Polars does not support in-place operations so when we do any transformations we re-assign the DataFrame variable. For example if we add a new constant column we must do it like this:

        
      
df = df.with_columns(w = pl.lit(1))

Creating a new DataFrame like this is cheap in Polars as it does not copy the underlying data, Polars just creates a new reference to the underlying data.

Filtering rows

Operation	Pandas	Polars
Filter rows	`df.loc[df.ref == 'c']`	`df.filter(pl.col("ref") == "c")`
	`df.query("ref == 'c'")`	`df.filter(ref = "c")` [1]
Filter rows (text operator)	`df.loc[df.ref.eq('c')]`	`df.filter(pl.col("ref").eq("c"))`
Multiple filters	`df.loc[(df.ref == 'c') & (df.x > 1)]`	`df.filter((pl.col("ref") == "c") & pl.col("x") > 1))`
		`df.filter(pl.col("ref") == "c", pl.col("x") > 1)` [2]
Multiple filters (optimised)		`df.lazy().filter(pl.col("ref") == "c").filter(pl.col("x") > 1)` [3]
Multiple filters (OR condition)	`df.loc[(df.ref == 'c')` ` (df.x > 1)]`	`df.filter((pl.col("ref") == "c")`	` pl.col(“x”) > 1))`
Is in condition	`df.loc[df.ref.isin(['a','b']) ]`	` df.filter(pl.col(“ref”).is_in([‘a’,’b’]))`
Is between condition	`df.loc[df.ref.between(2,4) ]`	` df.filter(pl.col(“x”).is_between(2,4))`

[1] In the second row of the Polars column we did a filter without using pl.col. This is because Polars has a special syntax for filtering by column name. This approach takes advantage of Python keyword arguments to a function to treat setting a keyword argument - like ref = "c" - as a filter condition. This approach only works for equality conditions.

[2] Here we apply multiple filter conditions without the annoying brackets and &. Instead we pass a comma separated list of conditions to the filter method.

[3] In the optimised example for Polars we used separate filter calls. However, as we are in lazy mode the Polars query optimiser combines these into a single filter condition applied on a single pass through the data.

There’s a lot more to filtering in Polars. For example, if you do a filter late in a query that could be done earlier (and so reduce the amount of data processed) then the Polars query optimiser moves the filter earlier in the query. Even better, if you apply a filter on Parquet data stored in the clouds Polars tries to apply those filters in the cloud storage layer before the data is transferred across the network. Check out my course for more on this.

Accessing data using `[]`

Operation	Pandas	Polars
Get column as Series	`df["grp"]`	`df["grp"]`
Cell indexing by location	`df.iloc[1, 1]`	`df[1, 1]`
Row slicing by location	`df.iloc[1:3]`	`df[1:3]`
Column slicing by location	`df.iloc[:, 1:]`	`df[:, 1:]`
Row indexing by label	`df.loc['c']`	`df.filter(pl.col("index") == "c")`
Column indexing by label	`df.loc[:, 'x']`	`df[:, "x"]`
		`df.select("x")`
Column indexing by labels	`df.loc[:, ['x', 'z']]`	`df[:, ['x', 'z']]`
		`df.select(['x', 'z'])`
Column slicing by label	`df.loc[:, 'x':'z']`	`df[:, "x":"z"]`
Mixed indexing	`df.loc['c'][1]`	`df.filter(pl.col("index") == "c")[0, 1]`

Note: when a query in Pandas returns a single row then that row is returned as a Series. If the row contains both floats and integers then Pandas casts the integers to floats in the Series. Polars returns a DataFrame with one row keeping the original dtypes.

Duplicates and missing values

Operation	Pandas	Polars
Keep unique rows	`df.drop_duplicates()`	`df.unique()` [1]
…based a subset of columns	`df.drop_duplicates(subset=["ref"])`	`df.unique(subset=["ref"])`
Drop rows with missing values	`df.dropna()`	`df.drop_nulls()`

[1] Be aware that in Polars the order of the output from df.unique() is not in general the same as the order of the input. In addition, the default choice of which of each duplicated row to keep is any rather first as in Pandas. I looked at the reasons for this behaviour and how you can control it in this post.

The missing value in Pandas depends on dtype of the column whereas in Polars a missing value is null for all dtypes.

Grouping data and aggregation

Polars has a group_by function to group rows. The following table illustrates some common grouping and aggregation usages. The code snippets are long so scroll horizontally to see Polars.

Operation	Pandas	Polars
Agg by groups	`df.groupby('grp')['x'].mean()`	`df.group_by('grp').agg(pl.col("x").mean()`
Agg multiple columns	`df.agg({'x': 'max', 'y': 'min'})`	`df.select([pl.col("x").max(),pl.col("y").min()])`
	`df[['x', 'y']].mean()`	`df.select(["x","y"]).mean()`
	`df.filter(regex=("^x")).mean()`	`df.select(pl.col("^x$").mean()`
		`df.select(cs.starts_with("x").mean()`
Rename column after aggregation	`df.groupby('grp')['x'].mean().rename("x_mean")`	`df.group_by("grp").agg(pl.col("x").mean().name.suffix("_mean"))`
Add aggregated data as column	`df.assign(x_mean=df.groupby("grp")["x"].transform("mean"))`	`df.with_column(pl.col("x").mean().over("grp").name.suffix("_mean"))`

The output of aggregations in Pandas can be a Series whereas in Polars it is always a DataFrame. Where the output is a Series in Pandas there is a risk of the dtype being changed such as ints to floats.

As noted for unique above be aware that the order of the rows in the output of groupby in Polars is random by default.

Pivoting and unpivoting/melting

Polars has moved away from Pandas syntax (e.g. melt becomes unpivot) to have a syntax that is more consistent between pivot and unpivot. In both cases we use the on argument to specify the column to pivot/unpivot on.

The order of the columns in a Polars pivot is random. We can fix this order by passing the sort_columns argument to pivot.

In Pandas the presence of missing values causes the integer values to be cast to float.

Check out the many other posts I’ve written on Polars!

Learn more

Want to know more about Polars for high performance data science and ML? Then you can:

or let me know if you would like a Polars workshop for your organisation.

Cheatsheet for Pandas to Polars

Accessing data in a DataFrame

Accessing data using the expression API

Selecting and transforming a DataFrame

Filtering rows

Accessing data using []

Duplicates and missing values

Grouping data and aggregation

Pivoting and unpivoting/melting

Learn more

Further Reading

What is a Polars expression?

Reading from S3 with Polars (or DeltaLake) using AWS SSO

Fitting linear models within Polars

Accessing data in a `DataFrame`

Selecting and transforming a `DataFrame`

Accessing data using `[]`