There are differences between some important time series concepts in Pandas and Polars that you should know. In this post, to help you make the Pandas to Polars switch I talk through some of these key differences.
I’m working with Polars version 0.20.6 here, but most of these changes should be independant of the version of Polars you are using.
Want to get going with Polars? Check out my Polars course here
No more string datetimes
In Pandas we can use date strings when working with dates and times. In Polars, on the other hand, we use Python datetime
objects and we never use strings to do datetime operations.
To illustrate this we create a timeseries in Polars and then convert it to Pandas. To create a date column in Polars we use the confusingly-named datetime.datetime
class in Python.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from datetime import datetime
import pandas as pd
import polars as pl
df_polars = pl.DataFrame(
{
"datetime": [
datetime(2021,1,1), datetime(2021,1,2), datetime(2021,1,3)
],
"value": [1, 2, 3]
}
)
df_pandas = df_polars.to_pandas()
df_polars
shape: (3, 2)
┌─────────────────────┬───────┐
│ datetime ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2021-01-01 00:00:00 ┆ 1 │
│ 2021-01-02 00:00:00 ┆ 2 │
│ 2021-01-03 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
In Pandas we can use datetime strings to filter datetimes like this:
1
df_pandas.loc[df_pandas["datetime"] > "2021-01-02"]
But in Polars we use the datetime.datetime
class to filter dates:
1
2
3
4
5
6
7
8
9
df_polars.filter(pl.col("datetime") > datetime(2021,1,2))
shape: (1, 2)
┌─────────────────────┬───────┐
│ datetime ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2021-01-03 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
The Polars developers chose not to support string datetime representations because they are ambiguous. For example, 2021-01-02
could be the 2nd of January or the 1st of February depending on the locale.
Of course, we can still extract a string representation as a string column using the dt.strftime
method:
1
2
3
4
5
6
7
8
9
10
11
df_polars.with_columns(pl.col("date").dt.strftime("%Y-%m-%d").alias("date_str"))
shape: (3, 3)
┌─────────────────────┬───────┬────────────┐
│ datetime ┆ value ┆ date_str │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ str │
╞═════════════════════╪═══════╪════════════╡
│ 2021-01-01 00:00:00 ┆ 1 ┆ 2021-01-01 │
│ 2021-01-02 00:00:00 ┆ 2 ┆ 2021-01-02 │
│ 2021-01-03 00:00:00 ┆ 3 ┆ 2021-01-03 │
└─────────────────────┴───────┴────────────┘
Want more time series tips? There is a whole time series section in my Polars course
Polars has different interval strings
In Pandas and Polars we can represent intervals using strings. In Pandas, for example, we use 30T
for 30 minutes. In Polars we use 30m
for 30 minutes. Here are some examples of interval strings in Polars:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)
- 1mo (1 calendar month)
- 1q (1 calendar quarter)
- 1y (1 calendar year)
We can compose these interval strings to create more complex intervals. For example, we can use 1h30m
for 1 hour and 30 minutes.
Polars works with microseconds by default
In both libraries the datetime, date and duration dtypes are all based on an underlying integer representation of time. For example, with the pl.Datetime
dtype, the integer represents a count since the start of the Unix epoch.
In Pandas the integer counts occur in nanoseconds by default but in Polars the integer counts occur in microseconds by default. The microseconds are denoted by us
in the DataFrame
schema below:
1
2
3
4
5
6
7
df_polars.schema
OrderedDict(
[
('datetime', Datetime(time_unit='us', time_zone=None)),
('value', Int64)
]
)
However, Polars also supports nanosecond precision while Pandas also supports microsecond precision.
If we convert a Pandas DataFrame
to a Polars DataFrame
then the integer representations remain in nanoseconds. We can’t join two Polars DataFrames
on a datetime if one has nanosecond precision and the other has microsecond precision. So when I convert from Pandas to Polars I normally cast datetime columns to microseconds straight away using the dt.cast_time_unit
expression:
1
df_polars = pl.from_pandas(df_pandas).with_columns(pl.col("datetime").dt.cast_time_unit("us"))
A missing datetime in Polars is a null
rather than a NaT
In Pandas a missing datetime in a datetime column is represented by NaT
(not a time). In Polars a missing datetime is represented by the same value it is represented by in every column: null
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df_polars = pl.DataFrame(
{
"datetime": [
datetime(2021,1,1), None, datetime(2021,1,3)
],
"value": [1, 2, 3]
}
)
shape: (3, 2)
┌─────────────────────┬───────┐
│ datetime ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2021-01-01 00:00:00 ┆ 1 │
│ null ┆ 2 │
│ 2021-01-03 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
I find that having the same representation for missing values in every column makes it easier to work with missing values in Polars. This is because I don’t have to remember different approaches for missing values in different dtypes e.g .isna
versus isnull
in Pandas.
Temporal groupby in Polars has its own method
In Pandas you do temporal groupby by passing the pd.Grouper
method:
1
df_pandas.set_index("datetime").groupby(pd.Grouper(freq='D')).mean()
In Polars we have a special method for temporal groupby group_by_dynamic
. In this example we get
the mean value for each day:
1
df_polars.sort("datetime").group_by_dynamic("datetime", every="1d").agg(pl.col("value").mean())
Note that we sort the DataFrame
by the datetime column before we do the groupby. This is because the group_by_dynamic
method requires the data to be sorted by the column we are grouping by.
As in Pandas we have lots of flexibility in how the grouping windows are set. For example we if want to offset the start of the windows by 2 hours we can do this:
1
df_polars.sort("datetime").group_by_dynamic("datetime", every="1d", offset="2h").agg(pl.col("value").mean())
Resample in Pandas is upsample
or group_by_dynamic
in Polars
In Pandas we use the resample
method to change the frequency of a time series.
In Polars we can use the upsample
method change to a higher frequency than the data.
For example, to upsample a 1 hour frequency to 30 minutes we can do this:
1
df_polars.sort("datetime").upsample("30m", "datetime")
Polars has fast-path operations on sorted data
Polars can take advantage of sorted data to speed up operations using fast-path operations. These fast-path operations occur where Polars knows a column is sorted and can therefore use a faster algorithm to perform the operation. As time series data has a natural sort order it is particularly important to be aware of fast-paths for time series analysis.
We can adapt our filter
code above for a simple example of a fast-path operation on time series data. This time we are looking for datetimes before the 2nd of January.
1
df_polars.filter(pl.col("datetime") < datetime(2021,1,2))
If Polars knows that the datetime
column is sorted then the fast-path operation is to stop scanning the column once it finds the first row that is greater than or equal to the filter value. This can be much faster than scanning the whole column.
Other important time-series methods that support fast-path operations include group_by
and join
.
Check out these posts for more on fast-path operations in Polars:
Or you can see my many other Polars posts here:https://www.rhosignal.com/tags/polars/
If you would like more detailed support on working with Polars then I provide consulting on optimising your data processing pipelines with Polars. You can also check out my online course to get you up-and-running with Polars by clicking on the bear below
Want to know more about Polars for high performance data science? Then you can: