This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters
One consequence of the Apache Arrow era is that different libraries will integrate more easily.
Here for example we load data from a Huggingface dataset into a Polars dataframe with zero-copy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from datasets import load_dataset
import polars as pl
dataset = load_dataset("rotten_tomatoes", split="train")
df = pl.from_arrow(dataset.data.table)
shape: (3, 2)
┌───────────────────────────────────────────────────────┬───────┐
│ text ┆ label │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════════════════════════════════════════════════════╪═══════╡
│ the rock is destined to be the 21st century's new ... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ the gorgeously elaborate continuation of " the lor... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ effective but too-tepid biopic ┆ 1 │
└───────────────────────────────────────────────────────┴───────┘
Hopefully there will be an explicit to_polars()
method in datasets.
I’ll be digging into this in more detail - can we exploit the memory-mapped datasets that datasets can produce with Polars new out-of-core capabilities?
Also: please don’t call libraries datasets😂
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on bluesky
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.