Polars has 4 native nested column types. These can be very helpful at solving problems such as:
- working with ML embeddings
- splitting strings
- working with nested JSON data
- working with aggregations
To take advantage of them it’s important you understand the difference between the types. In this post I set out the key differences between the nested column types and give some examples of when you might use each one.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course
Nested column types overview
The 4 native nested column types in Polars are:
pl.List
pl.Array
pl.Object
pl.Struct
We can immediately split these into two groups:
pl.List
,pl.Array
andpl.Object
store some kind of sequence on each rowpl.Struct
is a nested collection of columns
The sequence types
The sequence types pl.List
, pl.Array
and pl.Object
store some kind of sequence on each row. The main differences between them are how they store the sequence and whether the length of the sequence can be different on each row.
We can break the sequence types into two groups:
pl.List
andpl.Array
store the data on each row in a PolarsSeries
pl.Object
stores the data on each row in a Pythonlist
pl.List
and pl.Array
On each row pl.List
and pl.Array
store the data in a Polars Series
. As with any Polars Series
the data in the Series must have a homogenous dtype e.g. floats as pl.Float32
or strings as pl.Utf8
. The dtype must also be the same for all rows in the column.
The difference between pl.List
and pl.Array
is that the length of the sequence can be different on each row for pl.List
but must be the same for pl.Array
. In this sense a pl.Array
is more comparable to a 2D numpy array where the first dimension is the length of the DataFrame
and the second dimension is the length of the array.
One further practical difference between pl.List
and pl.Array
is that pl.Array
is relatively new and has less functionality. You may need to use pl.List
while pl.Array
is further developed.
In this example we create a DataFrame
with a float pl.List
type and a mixed pl.Object
type. Polars infers the pl.List
type as pl.Float64
and the pl.Object
type as the data types are mixed for the pl.Object
column.
We then create a new pl.Array
column floats_array
by casting the floats
column to a pl.Array
type. To do this we specify the width of the array as 2 and the inner type as pl.Float64
.
To illustrate this we create a DataFrame
with each of the sequence types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl
df = pl.DataFrame(
{
"floats": [[0.0, 1], [2, 3]],
"mixed_object": [["a", 0], ["b", 1]]
}
).with_columns(
floats_array=pl.col("floats").cast(pl.Array(width=2, inner=pl.Float64))
)
shape: (2, 3)
┌────────────┬──────────────┬───────────────┐
│ floats ┆ mixed_object ┆ floats_array │
│ --- ┆ --- ┆ --- │
│ list[f64] ┆ object ┆ array[f64, 2] │
╞════════════╪══════════════╪═══════════════╡
│ [0.0, 1.0] ┆ ['a', 0] ┆ [0.0, 1.0] │
│ [2.0, 3.0] ┆ ['b', 1] ┆ [2.0, 3.0] │
└────────────┴──────────────┴───────────────┘
Note that the pl.Object
column has lists where each list has a mix different data types.
The use cases of the sequence types include working with vector data and splitting strings.
The pl.List
dtype is also used extensively internally - for example a group_by
creates a pl.List
column with the data for each group
and aggregations happen on this pl.List
column.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pl.DataFrame(
{
"grp": ["a", "a", "b"],
"value": [0, 1, 2]
}
).group_by("grp").agg(
pl.col("value")
)
shape: (2, 2)
┌─────┬───────────┐
│ grp ┆ value │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ b ┆ [2] │
│ a ┆ [0, 1] │
└─────┴───────────┘
The pl.Struct
type of nested columns
Whereas the sequence types above have a sequence on each row, the pl.Struct
type is a nested collection of columns. The pl.Struct
is really just a way of having a nested namespace for columns. The underlying columns are just normal Polars Series
.
Of course, like any Polars Series
the data in the columns underlying the pl.Struct
must have a homogenous dtype.
In this example we have a trades
column that is made of a list
of python dicts
. Each dict
has the same keys and the values have the same types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df_struct = (
pl.DataFrame(
{
"year":[2020,2021],
"trades":[
{"exporter":"India","importer":"USA","quantity":0.0},
{"exporter":"India","importer":"USA","quantity":1.5},
]
}
)
)
shape: (2, 2)
┌──────┬─────────────────────┐
│ year ┆ trades │
│ --- ┆ --- │
│ i64 ┆ struct[3] │
╞══════╪═════════════════════╡
│ 2020 ┆ {"India","USA",0.0} │
│ 2021 ┆ {"India","USA",1.5} │
└──────┴─────────────────────┘
If you have a pl.Struct
column and want to un-nest the columns back into a flat DataFrame
you can do so with unnest
1
2
3
4
5
6
7
8
9
10
df_struct.unnest('trades')
shape: (2, 4)
┌──────┬──────────┬──────────┬──────────┐
│ year ┆ exporter ┆ importer ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞══════╪══════════╪══════════╪══════════╡
│ 2020 ┆ India ┆ USA ┆ 0.0 │
│ 2021 ┆ India ┆ USA ┆ 1.5 │
└──────┴──────────┴──────────┴──────────┘
Use cases of the pl.Struct
type include working with nested JSON data and collapsing columns into groups when working with wide DataFrames
.
That’s it for the intro to the nested column types. If you want to learn more about working with the pl.List
dtype check out my post focused on that. For a more comprehensive intro check out my Data Analysis with Polars course.
Next steps
Want to know more about Polars for high performance data science? Then you can: