I got a good question recently from a new Polars user: What is the difference between a Series
and an expression in Polars?
Well, a Series
is a 1D data structure. An expression is a function that operates on a Series
to produce a new Series
.
So we use an expression to transform a Series
.
Want to get going with Polars? Check out my Polars course here
There are many examples of expressions in Polars:
- the simplest example is the identity expression, which just returns the
Series
it is givenpl.col("id")
- we can transform the contents of a
Series
using expressions likepl.col("id").str.to_uppercase()
- we can do arithmetic operations on
Series
using expressions likepl.col("value") + 1
- we can aggregate
Series
using expressions likepl.col("value").sum()
- we can change the name of the output
Series
using expressions likepl.col("value").alias("double_value")
- we can apply expressions over groups using expressions like
pl.col("value").sum().over("id")
What is the Expression API?
Next question: what is the expression API that the Polars docs are always talking about?
The expression API is the collective name for the methods in Polars that take expressions as arguments and the expressions themselves.
Methods that take expressions as arguments
For examples of DataFrame
methods that take expressions as arguments we have:
df.filter(pl.col("id") == 1)
df.select(pl.col("id").str.to_uppercase())
df.with_columns(double_value=pl.col("value") * 2)
df.groupby(pl.col("id")).agg(pl.col("value").sum())
When we use these methods with expressions we have an important concept to understand: context. The context tells us what actual data will be used as the input to the expression.
For example, when we do df.filter
or df.select
we are in the select
context. This context means that the whole column of the DataFrame
is the input to the expression.
When we do df.groupby
we are in the groupby
context. This context means that the input to the expression is the group of rows that have the same value in the grouping column.
The expressions themselves
The other component of the expression API is the expressions themselves. These are the functions that we use to transform Series
and aggregate Series
.
These functions are listed in the Polars API documentation. The functions are grouped into categories like:
- Aggregation for aggregating (obviously)
- Computation for computing (obviously)
- Columns / names for working with column names
- Window for applying expressions over groups
There are also categories for expressions that apply only to certain dtypes such as:
- Strings for string manipulation and matching
- Temporal for working with dates and times
- Array or List for working with arrays and lists
What are the benefits of the expression API?
The expression API allows you to extract all the Power of Polars.
Firstly, when we apply multiple expressions in the same context Polars can run them in parallel.
Secondly, the expression API allows you to work in lazy mode. An expression is really an instruction to the Polars query engine of what you want to do. In lazy mode you can build up a complex data processing pipeline and then Polars can apply query optimisations before it is executed.
Finally, the expression API allows you to work with larger-than-memory datasets. Polars can handle datasets that are too large to fit into memory by working with the data in chunks. This is called chunked processing and it is a key feature of the expression API.
If you would like more detailed support on working with Polars then I provide consulting on optimising your data processing pipelines with Polars. You can also check out my online course to get you up-and-running with Polars by clicking on the bear below
Want to know more about Polars for high performance data science? Then you can: