Home Reading from S3 with Polars (or DeltaLake) using AWS SSO
Post
Cancel

Reading from S3 with Polars (or DeltaLake) using AWS SSO

Polars can read and write files from S3. However, to do this Polars needs to authenticate into your AWS account. While there is a crude solution where we copy our AWS access key and secret key from the .aws/credentials file directly into Polars, this does not work with AWS Single Sign On (SSO). In this post I show you a more generic approach for handling authentication that works with SSO.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

How Polars reads a file from S3

Polars doesn’t handle interaction with AWS (or any other cloud provider) directly. Instead it uses the Rust object_store crate to read and write files from cloud storage. This crate is a wrapper around the supported cloud storage providers’ SDKs.

The excellent Rust implementation of DeltaLake library also uses object_store under the hood and so the same approach works for DeltaLake.

At present object_store expects to infer the authentication details from environment variables if the authentication details are not passed explicitly. To work with SSO we need to access the environment variables - including the session token - that are set when we authenticate with AWS SSO.

The first step is to log-in to your SSO account in the normal way (using aws sso login). We then use the boto3 library to access the credentials which we then set as environment variables.

1
2
3
4
5
6
7
8
9
10
11
import boto3
import os

session = boto3.Session()
credentials = session.get_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
token = credentials.token
os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
os.environ["AWS_SESSION_TOKEN"] = credentials.token

With these environment variables set we can then use Polars to read or - even better - scan files from S3 as normal. Note that we don’t pass in the authenitcation details to Polars

1
2
df_parquet = pl.read_parquet(f"s3://{bucket}/{parquet_key}")
df_parquet = pl.scan_parquet(f"s3://{bucket}/{parquet_key}")

An alternative approach that keeps the authentication details out of environment variables is to pass the credentials directly to object_store via Polars

1
2
3
4
5
6
7
8
9
10
11
df_parquet = (
    pl.read_parquet(
        url,
        storage_options={
            "aws_access_key_id":credentials.access_key,
            "aws_secret_access_key":credentials.secret_key,
            "session_token":credentials.token, 
            "region":"us-east-1",
            }
    )
)

Get in touch on social media if you have any questions or comments.

Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course

Next steps

Want to know more about Polars for high performance data science? Then you can:

This post is licensed under CC BY 4.0 by the author.