Published on: 11th October 2022 Update: 1st November 2023
Can you use Polars to fit ML models without Numpy?
The data in a Polars DataFrame
is backed stored in an Apache Arrow table rather than a Numpy array. In the early days this meant that we had to convert the data to a Numpy array manually before we could use it in machine learning libraries.
However, this is no longer the case. In this post we see how that we can fit XGboost and some scikit-learn models directly from a Polars DataFrame
. The journey isn’t fully over though - there is likely to be internal copying of the data to the libraries preferred format internally.
This post was created while writing my Up & Running with Polars course. Check it out here with a free preview of the first chapters
Let’s create a Polars DataFrame
with some random data and see if we can fit an XGBoost model directly from it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import polars as pl
import xgboost as xgb
# Set the number of rows in the DF
N = 100
# Create the DF with 2 features and a label
df = (
pl.DataFrame(
{
# Use pl.arange to create a sequence of integers
"feat1":pl.arange(0,N,eager=True),
# Shuffle the sequence for the second feature
"feat2":pl.arange(0,N,eager=True).shuffle(),
# Create a label with 0s and 1s
"label":[0]*(N//2) + [1]*(N//2)
}
)
)
model = xgb.XGBClassifier(objective='binary:logistic')
# Fit the model
# X = df.select("feat1","feat2")
# y = df.select("label")
model.fit(
X = df.select("feat1","feat2"),
y= df.select("label")
)
# Add the prediction probabilities to the DF
df = pl.concat([
df,
pl.DataFrame(model.predict_proba(X)[:,1],schema=["pos"])
],
how="horizontal"
)
This all works and we can let XGBoost handle any data conversions.
Now let’s try with a logistic regression model from scikit-learn.
1
2
3
4
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df.select("feat1","feat2"),df.select("label"))
model.predict(df.select("feat1","feat2"))
Again this just works. Note that scikit-learn currently does an internal copy from Polars to Numpy but with this support we’re a step towards full native support for Arrow data. Not all scikit learn models and processes currently support Polars but this is still a great step forward.
Learn more
Want to know more about Polars for high performance data science and ML? Then you can:
- check out my Polars course on Udemy
- follow me on bluesky
- follow me on twitter
- connect with me at linkedin
- check out my youtube videos
or let me know if you would like a Polars workshop for your organisation.