Two major advantages of Polars over Pandas is that Polars has a lazy mode with query optimization and that Polars can scale to larger-than-memory datasets with its streaming mode. Taking advantage of these sometimes requires re-thinking how you might write the same operations in Pandas…
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
A gappy time series
In this simple example I have a time series that is missing some times. I want to add rows in with these missing times and interpolate over the gaps.
We define a time series that is missing an entry at 2020-01-01 02:00:00.
1
2
3
4
5
6
7
8
9
10
11
12
import polars as pl
df = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1),
datetime(2020, 1, 1, 1),
datetime(2020, 1, 1, 3),
],
"values": [0, 1, 3],
}
)
This DataFrame
looks like this:
1
2
3
4
5
6
7
8
9
10
shape: (3, 2)
┌─────────────────────┬────────┐
│ time ┆ values │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0 │
│ 2020-01-01 01:00:00 ┆ 1 │
│ 2020-01-01 03:00:00 ┆ 3 │
└─────────────────────┴────────┘
Filling gaps in eager mode with upsample
We can fills gaps in eager mode with upsample - just as in Pandas.
1
df.set_sorted('time').upsample('time',every='1h')
The output with upsample
looks like this:
1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time ┆ values │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0 │
│ 2020-01-01 01:00:00 ┆ 1 │
│ 2020-01-01 02:00:00 ┆ null │
│ 2020-01-01 03:00:00 ┆ 3 │
└─────────────────────┴────────┘
The main difference between Polars and Pandas here is that Polars requires the column to be sorted before calling upsample
. This is because the implementation requires sorted data and Polars wants to avoid doing an expensive sort if we can tell it that the data is already sorted.
The problem with this approach is that upsample
is an eager operation. This means we have to load the whole DataFrame into memory before performing the operation. This is fine for small DataFrames but will not scale to larger-than-memory datasets.
Filling gaps in lazy mode
To fill gaps in lazy mode we first define a DataFrame
that has the time series with no gaps using the pl.date_range
function
1
2
3
4
5
6
7
8
9
10
pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 1, 3),
interval="1h",
eager=True,
)
}
)
We then left join the original DataFrame
to this DataFrame
using the time
column as the join key. The key point is that we call lazy
on each of the DataFrame
s before joining them. This tells Polars to perform the join in lazy mode.
1
2
3
4
5
6
7
8
9
10
pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 1, 3),
interval="1h",
eager=True,
)
}
).lazy().join(df.lazy(), on="time", how="left")
If we evalaute this code with collect
we get the following output:
1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time ┆ values │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0 │
│ 2020-01-01 01:00:00 ┆ 1 │
│ 2020-01-01 02:00:00 ┆ null │
│ 2020-01-01 03:00:00 ┆ 3 │
└─────────────────────┴────────┘
Interpolating over the gaps
We can now interpolate over the gaps using the interpolate
expression.
1
2
3
4
5
6
7
8
9
10
11
12
pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 1, 3),
interval="1h",
eager=True,
)
}
).lazy().join(df.lazy(), on="time", how="left").with_columns(
pl.col("values").interpolate()
)
If we evalaute this code with collect
we get the following output:
1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────────────────────┬────────┐
│ time ┆ values │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪════════╡
│ 2020-01-01 00:00:00 ┆ 0 │
│ 2020-01-01 01:00:00 ┆ 1 │
│ 2020-01-01 02:00:00 ┆ 2 │
│ 2020-01-01 03:00:00 ┆ 3 │
└─────────────────────┴────────┘
Streaming mode
As I covered in a previous post we can check if a lazy query will use the streaming engine if we call explain(streaming=True)
and there is a code block bounded by PIPELINE
.
1
2
3
4
5
6
7
8
9
10
11
12
pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 1, 3),
interval="1h",
eager=True,
)
}
).lazy().join(df.lazy(), on="time", how="left").with_columns(
pl.col("values").interpolate()
).explain(streaming=True)
In this case we get the following output:
1
2
3
4
5
6
7
8
9
10
11
WITH_COLUMNS:
[col("values").interpolate()]
--- PIPELINE
LEFT JOIN:
LEFT PLAN ON: [col("time")]
DF ["time"]; PROJECT */1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("time")]
DF ["time", "values"]; PROJECT */2 COLUMNS; SELECTION: "None"
END LEFT JOIN --- END PIPELINE
DF []; PROJECT */0 COLUMNS; SELECTION: "None"
Here we see that we can do the join in streaming mode but we can’t do the interpolation in streaming mode. Interpolation is a challenging operation for streaming mode as it can require data from different batches to perform the interpolation.
Generalising this approach for multuple time series
The example here is relatively simple but can be generalised. In the ML time series forecasting pipelines that I build, for example, I typically have multiple time series in a DataFrame
with an id
column to distinguise them. In this case I do an extra cross join of the time steps and the IDs before doing the left join with the gappy data.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: