Lazy mode in Polars does not only provide query optimisation and allow you to work with larger than memory datasets. It also provides some type security that can find errors in your pipeline before you start crunching through lots of data.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course
Basic setup
We illustrate the idea with a simple pipeline below where we create a DataFrame from some data and do a transformation on it in eager mode.
1
2
3
4
5
6
7
8
9
10
11
12
13
import polars as pl
df = (
pl.DataFrame(
{
"groups":['a','a','b','b','c'],
"values":[0,1,2,3,4]
}
)
.with_columns(
pl.col('values').round(0)
)
)
The problem is that our data transformation isn’t valid: the values
column has an integer dtype but the round
expression can only be called on a column with a floating point dtype.
If we run the code above we get the following exception:
1
SchemaError: Int64 is not a floating point datatype
In this eager mode example Polars found the error after it had created the DataFrame
and tried to do the with_columns
transformation.
What happens in lazy mode?
We can try this again in lazy mode where we convert the DataFrame
to a LazyFrame
once it has been created but before we do the with_columns
transformation.
1
2
3
4
5
6
7
8
9
10
11
12
df = (
pl.DataFrame(
{
"groups":['a','a','b','b','c'],
"values":[0,1,2,3,4]
}
)
.lazy()
.with_columns(
pl.col('values').round(0)
)
)
If we run this we see that…nothing happens. Polars has just created a LazyFrame
containing the erroneous expression. This is because Polars doesn’t test for schema errors until we execute the pipeline.
If we run try to execute the pipeline with collect
(to process all the data) or fetch
(to process a subset) then we see our SchemaError
.
1
2
df.collect()
SchemaError: Int64 is not a floating point datatype
The key point here is that Polars find this error when we call collect
but before the actual time-consuming part of processing the data.
In this way lazy mode in Polars can help you find errors in your pipeline at the start of a pipeline run rather than a long way into them.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: