Home
Rho Signal
Cancel

Exploding a Polars pivot for feature engineering

In my ML pipelines these days I find I replace some of the simpler scikit-learn metrics such as root-mean-squared-error with my own hand-rolled Polars expressions. This approach saves me from copyi...

Ordering of groupby and unique in Polars

Polars (and Apache Arrow) has been designed to be careful with your data so you don’t get surprises like the following Pandas code where the ints column has been cast to float because of the missin...

Concat, extend or vstack?

On the face of it the concat,extend and vstack functions in Polars can do the same job: they can take two initial DataFrames and turn them into a single DataFrame. In this post I show that they do ...

Filtering one df by another

One of the most common questions we get on the Polars discord is how to filter rows in one dataframe by values in another. I think people don’t realise this is a basically a join because they don’...

Embrace streaming mode in Polars

Polars can handle larger-than-memory datasets with its streaming mode. In this mode Polars processes your data in batches rather than all at once. However, the streaming mode is not some emergency ...

Lazy mode's hidden timesaver in Polars

Lazy mode in Polars does not only provide query optimisation and allow you to work with larger than memory datasets. It also provides some type security that can find errors in your pipeline before...

Polars 🤝 Seaborn

Update October 2023 As of Seaborn version v.13.0 Seaborn accepts Polars DataFrames natively🎆. Note that this is not full native support though. Polars copies the data internally to a Pandas Data...

Nested dtypes in Polars 1: the `pl.List` dtype

Polars uses Apache Arrow to store its data in-memory. One of the big advantages of Arrow is that it supports a variety of nested data types (or “dtypes”). In this post we look at the pl.List dtype ...

Talking Polars on the Real Python podcast

I appeared on the Real Python podcast to talk Polars! We chatted about: why lazy mode in Polars is so important working with larger-than-memory datasets transitioning from Pandas to Polars ...

Sinking larger-than-memory Parquet files

Polars now allows you to write Parquet files even when the file is too large to fit in memory. It does this by using streaming to process data in batches and then writing these batches to a Parquet...