This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters
Polars has a built in tool to go on a dtype diet.
Call the shrink_dtype
expression and it will convert the column to the dtype that requires the least amount of memory based on the data in the column.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import polars as pl
(
pl.DataFrame(
{
"a": [1, 2, 3],
"b": [1, 2, 2 << 32],
"c": [-1, 2, 1 << 30],
"d": [-112, 2, 112],
"e": [-112, 2, 129],
"f": ["a", "b", "c"],
"g": [0.1, 1.32, 0.12],
"h": [True, None, False],
}
)
.select(
pl.all().shrink_dtype()
)
)
┌─────┬────────────┬────────────┬──────┬──────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g ┆ h │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i8 ┆ i64 ┆ i32 ┆ i8 ┆ i16 ┆ str ┆ f32 ┆ bool │
╞═════╪════════════╪════════════╪══════╪══════╪═════╪══════╪═══════╡
│ 1 ┆ 1 ┆ -1 ┆ -112 ┆ -112 ┆ a ┆ 0.1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 2 ┆ 2 ┆ b ┆ 1.32 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 8589934592 ┆ 1073741824 ┆ 112 ┆ 129 ┆ c ┆ 0.12 ┆ false │
└─────┴────────────┴────────────┴──────┴──────┴─────┴──────┴───────┘
Both floats and integers default to 64-bit precision. In the example below from the API docs Polars sees that column “a” could be 8-bit, column “b” must be 64-bit, but column “c” could be 32-bit.
Casting numeric columns from 64-bit to 32-bit is often the easiest win in data science. Memory usage halves and computation time might also be half that of 64-bit.
You do need to check that the loss of precision is ok. I had sensors accurate to 0.01 so a change of 10^-6 was 👍
In my udemy course I show that if you cast to 8- or 16-bits memory usage continues to fall proportionally…
…but computation time probably won’t be better than 32-bits!
Most modern CPUs don’t have native support for 8- or 16-bit so they have to emulate it.
String columns with lots of repeated entries can also usefully be cast to categoricals. But that’s a story for another day.
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- join my Polars course on Udemy
- follow me on bluesky
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.