On the face of it the concat
,extend
and vstack
functions in Polars can do the same job: they can take two initial DataFrames
and turn them into a single DataFrame
. In this post I show that they do quite different things to your data underneath-the-hood and this can have a significant effect on your query performance.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
Basic setup
This is the basic setup - we want to combine two DataFrames
df1
and df2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import polars as pl
df1 = (
pl.DataFrame(
{
"id":[0,1],
"values":["a","b"]
}
)
)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 0 ┆ a │
│ 1 ┆ b │
└─────┴────────┘
df2 = (
pl.DataFrame(
{
"id":[2,3],
"values":["c","d"]
}
)
)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 2 ┆ c │
│ 3 ┆ d │
└─────┴────────┘
If we call any of concat
,vstack
or extend
we get the following output:
1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────┬────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪════════╡
│ 0 ┆ a │
│ 1 ┆ b │
│ 2 ┆ c │
│ 3 ┆ d │
└─────┴────────┘
So what’s the difference?
With two initial DataFrames
the data sits in two different locations in memory. When we combine them into a new DataFrame
there are three options:
- copy all the data to a single new location
- leave the data where it is and link the new
DataFrame
to the existing two locations in memory - copy the data from one location and append it to the data in the other location
Note that in the last case of appending there has to be enough space to append the data. If there isn’t then both are copied to a new location.
The three methods concat
,vstack
or extend
use these three options:
pl.concat([df_1,df_2])
copies all the data to a single new location when we use the defaultrechunk=True
argumentdf_1.vstack(df_2)
doesn’t copy any data and just links the newDataFrame
to the existing two locations in memorydf_1.extend(df_2)
copies the data fromdf_2
and appends it to the data fordf_1
I’m simplifying things a little bit for this post but these are the basic paradigms. Underneath-the-hood pl.concat
carries out a series of .vstack
operations (given a list
of DataFrames
) and then does the rechunk
operation to copy the data to a single location.
Pros and cons?
There are obviously pros and cons of these different approaches:
- Copying all data to a new location is expensive. However, having the data in a single location makes subsequent queries faster and gives more consistent results in terms of timing.
- Not copying any data is very fast (perhaps sub millisecond) but slows down subsequent queries.
- Appending the data from one location to the other is faster than copying both but it will be hard to predict when it won’t fit and both will need to be copied to a new location.
In my course I explore some relative timings of the different approaches in simple queries. In general if you are going to do subsequent operations on a DataFrame
then it’s normally worth copying the data to a single location with pl.concat
. However, if you just want to combine the DataFrames
to do something trivial - like checking the shape - then vstack
is the way to go. If you are adding a small DataFrame
to a large DataFrame
then extend
works really well as you are only copying the data from the small DataFrame
.
The best approach is very dependant on your problem, but I recommend comparing each of these methods if combining data is taking a lot of time in your pipeline.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: