Polars uses Apache Arrow to store its data in-memory. One of the big advantages of Arrow is that it supports a variety of nested data types (or “dtypes”). In this post we look at the pl.List
dtype in more detail:
- we start with an overview of the
pl.List
dtype - we call expressions on each row of a
pl.List
column - we do aggregations with neural network embeddings
- we do simple text analytics
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Overview of the pl.List
dtype
The pl.List
dtype allows us to store an array of values on each row. The crucial point is that the type of the values within each array must be the same and these types must be the same on all rows.
In this example, we create a DataFrame
with an integer, float and string pl.List
column. Note that:
- in the
floats
column we have a mix of floats and integers in one row and so Polars casts all values to a float type - the length of the arrays can vary within a column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import polars as pl
dfLists = pl.DataFrame({
'ints':[ [0,1], [4,3,2]],
'floats':[ [0.0,1], [2,3]],
'strings':[ ["0","1"],["2","3"]]
})
dfLists
shape: (2, 3)
┌───────────┬────────────┬────────────┐
│ ints ┆ floats ┆ strings │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[f64] ┆ list[str] │
╞═══════════╪════════════╪════════════╡
│ [0, 1] ┆ [0.0, 1.0] ┆ ["0", "1"] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 3, 2] ┆ [2.0, 3.0] ┆ ["2", "3"] │
└───────────┴────────────┴────────────┘
The key point to understand with the pl.List
dtype is that each row is a pl.Series
underneath the hood. This means that operations on a pl.List
column will be fast vectorised operations.
Expressions within arrays
In the use cases later in this post we see how to apply expressions on the entire array. However, we can also apply expressions row-by-row on a pl.List
column.
In this example we rank the elements within each array
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(
dfLists
.with_columns(
pl.col("ints").arr.eval(
pl.element().rank(method="ordinal")
)
)
)
shape: (2, 1)
┌───────────┐
│ ints │
│ --- │
│ list[u32] │
╞═══════════╡
│ [1, 2] │
│ [3, 2, 1] │
└───────────┘
To call the rank
expression inside each array we
- call
arr.eval
on theints
column - inside
arr.eval
we callpl.element
to start the expression for each row and - then we call
rank
onpl.element
to do therank
expression on each row
Use cases
Analysis of embeddings
The pl.List
dtype is a great option when you are working with embeddings from a neural network model alongside other metadata.
In the example below we have a doc_id
column to identify the document each row came from, a text
column showing a chunk of text from each document and an embeddings
column with the embeddings for that text.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df = pl.DataFrame(
{
"doc_id":[0,0,1,1,2,2],
"text":
[
"Polars is a dataframe library",
"Polars is written in Rust",
"Expressions allow you to transform data",
"Expressions run in paralell",
"Apache Arrow supports nested data",
"There are three nested dtypes"
]
}
)
.with_columns(
pl.Series(
"embeddings",
[pl.Series("",np.random.randint(0,5,3)) for _ in range(6)]
)
)
shape: (6, 3)
┌────────┬─────────────────────────────────────┬────────────┐
│ doc_id ┆ text ┆ embeddings │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ list[i64] │
╞════════╪═════════════════════════════════════╪════════════╡
│ 0 ┆ Polars is a dataframe library ┆ [1, 1, 4] │
│ 0 ┆ Polars is written in Rust ┆ [2, 3, 3] │
│ 1 ┆ Expressions allow you to transfo... ┆ [4, 3, 4] │
│ 1 ┆ Expressions run in paralell ┆ [4, 0, 0] │
│ 2 ┆ Apache Arrow supports nested dat... ┆ [1, 2, 1] │
│ 2 ┆ There are three nested dtypes ┆ [1, 0, 0] │
└────────┴─────────────────────────────────────┴────────────┘
We then get the document-averaged embeddings by doing a groupby
on the doc_id
column and averaging the embeddings
1
2
3
4
5
6
7
8
9
(
df
.groupby(
"doc_id"
)
.agg(
pl.col("embeddings").arr.mean()
)
)
We do the aggregation using arr.mean
rather than just mean
. By using arr.mean
we take advantage of the array expressions for the pl.List
dtype in the arr
namespace. You can see the full set of expressions here.
Word counts
Another use case for arrays is when we split strings. In this example we split the text
column by whitespace to get individual words. This transforms the text
column into a column with arrays of strings.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df2 = (
df
.with_columns(
pl.col("text").str.split(" ")
)
.select(
["doc_id","text"]
)
)
shape: (6, 2)
┌────────┬─────────────────────────────────────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞════════╪═════════════════════════════════════╡
│ 0 ┆ ["Polars", "is", ... "library"] │
│ 0 ┆ ["Polars", "is", ... "Rust"] │
│ 1 ┆ ["Expressions", "allow", ... "da... │
│ 1 ┆ ["Expressions", "run", ... "para... │
│ 2 ┆ ["Apache", "Arrow", ... "data"] │
│ 2 ┆ ["There", "are", ... "dtypes"] │
└────────┴─────────────────────────────────────┘
With this array of strings we can then count the number of words on each row using arr.lengths
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(
df2
.with_columns(
pl.col("text").arr.lengths()
)
)
shape: (6, 2)
┌────────┬──────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪══════╡
│ 0 ┆ 5 │
│ 0 ┆ 5 │
│ 1 ┆ 6 │
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 2 ┆ 5 │
└────────┴──────┘
Alternatively, we can count the occurence of each word. We do this by calling explode
to transform the string arrays into separate rows
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(
df2
.select(["doc_id","text"])
.explode("text")
)
shape: (30, 2)
┌────────┬───────────┐
│ doc_id ┆ text │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪═══════════╡
│ 0 ┆ Polars │
│ 0 ┆ is │
│ 0 ┆ a │
│ 0 ┆ dataframe │
│ ... ┆ ... │
│ 2 ┆ are │
│ 2 ┆ three │
│ 2 ┆ nested │
│ 2 ┆ dtypes │
└────────┴───────────┘
From this exploded format we can then count the word occurence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(
df2
.explode("text")
["text"]
.value_counts(sort=True)
)
shape: (24, 2)
┌─────────────┬────────┐
│ text ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════════╪════════╡
│ Polars ┆ 2 │
│ is ┆ 2 │
│ in ┆ 2 │
│ Expressions ┆ 2 │
│ ... ┆ ... │
│ There ┆ 1 │
│ are ┆ 1 │
│ three ┆ 1 │
│ dtypes ┆ 1 │
└─────────────┴────────┘
Of course, this is just the start of what we can do with the pl.List
dtype. Get in touch on twitter/linkedin/youtube if you find other interesting use cases or check out my course to learn more.
Next steps
Want to know more about Polars for high performance data science? Then you can: