Polars can read and write files from S3. However, to do this Polars needs to authenticate into your AWS account. While there is a crude solution where we copy our AWS access key and secret key from the .aws/credentials file
directly into Polars, this does not work with AWS Single Sign On (SSO). In this post I show you a more generic approach for handling authentication that works with SSO.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
How Polars reads a file from S3
Polars doesn’t handle interaction with AWS (or any other cloud provider) directly. Instead it uses the Rust object_store crate to read and write files from cloud storage. This crate is a wrapper around the supported cloud storage providers’ SDKs.
The excellent Rust implementation of DeltaLake library also uses object_store under the hood and so the same approach works for DeltaLake.
At present object_store expects to infer the authentication details from environment variables if the authentication details are not passed explicitly. To work with SSO we need to access the environment variables - including the session token - that are set when we authenticate with AWS SSO.
The first step is to log-in to your SSO account in the normal way (using aws sso login
). We then use the boto3
library to access the credentials which we then set as environment variables.
1
2
3
4
5
6
7
8
9
10
11
import boto3
import os
session = boto3.Session()
credentials = session.get_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token
os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
os.environ["AWS_SESSION_TOKEN"] = credentials.token
With these environment variables set we can then use Polars to read or - even better - scan files from S3 as normal. Note that we don’t pass in the authenitcation details to Polars
1
2
df_parquet = pl.read_parquet(f"s3://{bucket}/{parquet_key}")
df_parquet = pl.scan_parquet(f"s3://{bucket}/{parquet_key}")
An alternative approach that keeps the authentication details out of environment variables is to pass the credentials directly to object_store via Polars
1
2
3
4
5
6
7
8
9
10
11
df_parquet = (
pl.read_parquet(
url,
storage_options={
"aws_access_key_id":credentials.access_key,
"aws_secret_access_key":credentials.secret_key,
"session_token":credentials.token,
"region":"us-east-1",
}
)
)
Get in touch on social media if you have any questions or comments.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course
Next steps
Want to know more about Polars for high performance data science? Then you can: