Updated June 2024 for Polars version 1.0
In this post we see how to read and write from a CSV or Parquet file in S3 with Polars. We also see how to filter the file on S3 before downloading it to reduce the amount of data transferred across the network.
Want to accelerate your analysis with Polars? Join over 2,000 learners on my highly-rated Up & Running with Polars course.
Writing a file to S3
We create a simple DataFrame
with 3 columns which we will write to both a CSV and Parquet file in S3 using s3fs
. The s3fs
library allows you to read and write files to S3 with similar syntax to working on a local file system.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
bucket_name = "my_bucket"
csv_key = "test_write.csv"
parquet_key = "test_write.parquet"
fs = s3fs.S3FileSystem()
df = pl.DataFrame(
{
"foo": [1, 2, 3, 4, 5],
"bar": [6, 7, 8, 9, 10],
"ham": ["a", "b", "c", "d", "e"],
}
)
with fs.open(f"{bucket_name}/{csv_key}", mode="wb") as f:
df.write_csv(f)
with fs.open(f"{bucket_name}/{parquet_key}", mode="wb") as f:
df.write_parquet(f)
I recommend the Parquet format as it has a smaller file size, preserves dtypes and makes subsequent reads faster.
Reading a file from S3
We can use Polars to read the entire file back from S3 using pl.read_csv
1
2
df_csv = pl.read_csv(f"s3://{bucket}/{csv_key}")
df_parquet = pl.read_parquet(f"s3://{bucket}/{parquet_key}")
Internally Polars reads the remote file into a memory buffer using ffspec and then reads the buffer into a DataFrame. This is a fast approach but it does mean that the whole file is read into memory. This is fine for small files but can be slow and memory intensive for large files.
However, reading the whole file is wasteful when we only want to read a subset of rows. With a Parquet file we can instead scan the file on S3 and only read the rows we need.
Scanning a file on S3 with query optimisation
With a Parquet file we can scan the file on S3 and build a lazy query. The Polars query optimiser applies:
- predicate pushdown meaning that it tries to limit the number of rows to read from S3 and
- projection pushdown meaning that it tries to limit the number of columns to read from S3
We can do this with pl.scan_parquet
. This may also require some cloud storage provider specific options to be passed (see this post for more on authentication)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import polars as pl
source = "s3://bucket/*.parquet"
storage_options = {
"aws_access_key_id": "<secret>",
"aws_secret_access_key": "<secret>",
"aws_region": "eu-west-1",
}
df = (
# Scan the file on S3
pl.scan_parquet(source, storage_options=storage_options)
# Apply a filter condition
.filter(pl.col("id") > 100)
# Select only the columns we need
.select("id","value")
# Collect the data
.collect()
)
In this case Polars will only read the id
and value
columns from the Parquet and only the rows where the id
column is greater than 100. This can be much faster and more memory efficient than reading the whole file.
We can also (post version 1.0 of Polars) scan a CSV file in cloud storage
1
2
3
4
5
6
7
8
9
10
df = (
# Scan the file on S3
pl.scan_csv(source, storage_options=storage_options)
# Apply a filter condition
.filter(pl.col("id") > 100)
# Select only the columns we need
.select("id","value")
# Collect the data
.collect()
)
The limitations of CSVs compared to Parquet apply here, of course. For example, we cannot extract a subset of columns or rows from a CSV file as we can from Parquet.
Wrap-up
In this post we have seen how to read and write files from S3 with Polars. This is a fast-developing area so I’m sure I’ll be back to update this post in the future (again!) as Polars does more of this natively.
There are more sophisticated ways to manage data on S3. For example, you could use a data lake tool like Delta Lake to manage your data on S3. These tools allow you to manage your data in a more structured way and to perform operations like upserts and deletes. See this post by Matthew Powers for an intro to using Delta Lake with Polars.
Again, I invite you to buy my Polars course if you want to learn the Polars API and how to use Polars in the real world.
Next steps
Want to know more about Polars for high performance data science? Then you can: