Home Concat, extend or vstack?
Post
Cancel

Concat, extend or vstack?

On the face of it the concat,extend and vstack functions in Polars can do the same job: they can take two initial DataFrames and turn them into a single DataFrame. In this post I show that they do quite different things to your data underneath-the-hood and this can have a significant effect on your query performance.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course

Basic setup

This is the basic setup - we want to combine two DataFrames df1 and df2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import polars as pl

df1 = (
    pl.DataFrame(
        {
            "id":[0,1],
            "values":["a","b"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 0    a      
 1    b      
└─────┴────────┘
df2 = (
    pl.DataFrame(
        {
            "id":[2,3],
            "values":["c","d"]
        }
    )
)
shape: (2, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 2    c      
 3    d      
└─────┴────────┘

If we call any of concat,vstack or extend we get the following output:

1
2
3
4
5
6
7
8
9
10
11
shape: (4, 2)
┌─────┬────────┐
 id   values 
 ---  ---    
 i64  str    
╞═════╪════════╡
 0    a      
 1    b      
 2    c      
 3    d      
└─────┴────────┘

So what’s the difference?

With two initial DataFrames the data sits in two different locations in memory. When we combine them into a new DataFrame there are three options:

  • copy all the data to a single new location
  • leave the data where it is and link the new DataFrame to the existing two locations in memory
  • copy the data from one location and append it to the data in the other location

Note that in the last case of appending there has to be enough space to append the data. If there isn’t then both are copied to a new location.

The three methods concat,vstack or extend use these three options:

  • pl.concat([df_1,df_2]) copies all the data to a single new location when we use the default rechunk=True argument
  • df_1.vstack(df_2) doesn’t copy any data and just links the new DataFrame to the existing two locations in memory
  • df_1.extend(df_2) copies the data from df_2 and appends it to the data for df_1

I’m simplifying things a little bit for this post but these are the basic paradigms. Underneath-the-hood pl.concat carries out a series of .vstack operations (given a list of DataFrames) and then does the rechunk operation to copy the data to a single location.

Pros and cons?

There are obviously pros and cons of these different approaches:

  • Copying all data to a new location is expensive. However, having the data in a single location makes subsequent queries faster and gives more consistent results in terms of timing.
  • Not copying any data is very fast (perhaps sub millisecond) but slows down subsequent queries.
  • Appending the data from one location to the other is faster than copying both but it will be hard to predict when it won’t fit and both will need to be copied to a new location.

In my course I explore some relative timings of the different approaches in simple queries. In general if you are going to do subsequent operations on a DataFrame then it’s normally worth copying the data to a single location with pl.concat. However, if you just want to combine the DataFrames to do something trivial - like checking the shape - then vstack is the way to go. If you are adding a small DataFrame to a large DataFrame then extend works really well as you are only copying the data from the small DataFrame.

The best approach is very dependant on your problem, but I recommend comparing each of these methods if combining data is taking a lot of time in your pipeline.

Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course )

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.