In my ML pipelines these days I find I replace some of the simpler scikit-learn metrics such as root-mean-squared-error with my own hand-rolled Polars expressions. This approach saves me from copying data to a different format and ensures I can keep the normal advantages of Polars such as parallelisation, optimisation and scaling to large datasets.
Recently I was adding a new section to my Up & Running with Polars course(check it out here) on pivoting data when it struck me that the CountVectorizer method is based on a pivot. I decided to see how much effort it would take to re-implement this in Polars.
For anyone not familiar with CountVectorizer - it is a feature engineering technique where each column of the 2D array corresponds to a word and each row corresponds to a document. The value in each cell is 1 if that word is present in that document and 0 otherwise. See below for an example of the output.
Getting some fake data
I needed some fake text data for this exercise so I asked ChatGPT to generate a small dataset of fake news articles along with publication name and title. It delivered me a truly fake dataset with artices from The Daily Deception and the Faux News Network:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
fake_news_df = pl.DataFrame(
{
'publication': [
'The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 'The Hoax Herald', ],
'title': [
'Scientists Discover New Species of Flying Elephant',
'Aliens Land on Earth and Offer to Solve All Our Problems',
'Study Shows That Eating Pizza Every Day Leads to Longer Life',
'New Study Finds That Smoking is Good for You',
"World's Largest Iceberg Discovered in Florida"],
'text': [
'In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',
'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',
'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',
'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',
'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']
}
)
Split, explode and pivot
The first thing we need to do is convert the text to lowercase and split each
article into separate words. We do this with expressions from the str
namespace. We also add a column called placeholder
with a value of 1. These are the 1’s that will later populate our feature matrix.
1
2
3
4
5
6
7
(
fake_news_df
.with_columns(
pl.col("text").str.to_lowercase().str.split(" "),
pl.lit(1).alias("placeholder")
)
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
shape: (5, 4)
┌─────────────────────┬───────────────────────────────┬──────────────────────────────┬─────────────┐
│ publication ┆ title ┆ text ┆ placeholder │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] ┆ i32 │
╞═════════════════════╪═══════════════════════════════╪══════════════════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New ┆ ["in", "a", … "zoology."] ┆ 1 │
│ ┆ Species … ┆ ┆ │
│ Faux News Network ┆ Aliens Land on Earth and ┆ ["in", "a", … "out."] ┆ 1 │
│ ┆ Offer t… ┆ ┆ │
│ The Fabricator ┆ Study Shows That Eating Pizza ┆ ["a", "new", … "nutrition."] ┆ 1 │
│ ┆ Ev… ┆ ┆ │
│ The Misleader ┆ New Study Finds That Smoking ┆ ["in", "a", … "experts."] ┆ 1 │
│ ┆ is … ┆ ┆ │
│ The Hoax Herald ┆ World's Largest Iceberg ┆ ["in", "a", … "have"] ┆ 1 │
│ ┆ Discover… ┆ ┆ │
└─────────────────────┴───────────────────────────────┴──────────────────────────────┴─────────────┘
By splitting the string values we turn the string column into a column with a Polars pl.List(str)
dtype. In a previous post I showed how a pl.List
type allows fast operations because each row is a Polars Series under the hood rather than a slow Python list.
However, it would still be better to stretch out that pl.List
column to have a row for each element of each list. At the same time, we want to keep the metadata of the original article such as the publication name and title.
We do this stretching out by calling explode
on the text
column to give us a row for each element of each list
1
2
3
4
5
6
7
8
(
fake_news_df
.with_columns(
pl.col("text").str.to_lowercase().str.split(" "),
pl.lit(1).alias("placeholder")
)
.explode("text")
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
shape: (306, 4)
┌─────────────────────┬───────────────────────────────────┬────────────────┬─────────────┐
│ publication ┆ title ┆ text ┆ placeholder │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i32 │
╞═════════════════════╪═══════════════════════════════════╪════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New Species … ┆ in ┆ 1 │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ a ┆ 1 │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ groundbreaking ┆ 1 │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ discovery, ┆ 1 │
│ … ┆ … ┆ … ┆ … │
│ The Hoax Herald ┆ World's Largest Iceberg Discover… ┆ this ┆ 1 │
│ The Hoax Herald ┆ World's Largest Iceberg Discover… ┆ size ┆ 1 │
│ The Hoax Herald ┆ World's Largest Iceberg Discover… ┆ could ┆ 1 │
│ The Hoax Herald ┆ World's Largest Iceberg Discover… ┆ have ┆ 1 │
└─────────────────────┴───────────────────────────────────┴────────────────┴─────────────┘
Note that the explode
method can be used with the streaming engine in Polars so you can use it on larger-than-memory datasets.
Now it’s time to transform the text
column so we have a column for each distinct word and a row for each article. We do this by calling pivot
with the metadata columns (publication
and title
) as the index for each row, the text
column to see the column names and the placeholder
values to be the values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(
fake_news_df
.with_columns(
pl.col("text").str.to_lowercase().str.split(" "),
pl.lit(1).alias("placeholder")
)
.explode("text")
.pivot(
index=["publication","title"],
columns="text",
values="placeholder",
sort_columns=True
)
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
shape: (5, 166)
┌─────────────────────┬────────────────────┬────────┬──────┬───┬─────────┬───────┬──────┬──────────┐
│ publication ┆ title ┆ 10,000 ┆ 100 ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ i32 ┆ ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════════════════════╪════════════════════╪════════╪══════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists ┆ null ┆ 1 ┆ … ┆ null ┆ null ┆ null ┆ 1 │
│ ┆ Discover New ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ Species … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ Faux News Network ┆ Aliens Land on ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ ┆ Earth and Offer t… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Fabricator ┆ Study Shows That ┆ 1 ┆ null ┆ … ┆ null ┆ 1 ┆ null ┆ null │
│ ┆ Eating Pizza Ev… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Misleader ┆ New Study Finds ┆ null ┆ null ┆ … ┆ null ┆ null ┆ 1 ┆ null │
│ ┆ That Smoking is … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Hoax Herald ┆ World's Largest ┆ null ┆ 1 ┆ … ┆ 1 ┆ null ┆ null ┆ null │
│ ┆ Iceberg Discover… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└─────────────────────┴────────────────────┴────────┴──────┴───┴─────────┴───────┴──────┴──────────┘
Note that we use the sort_columns
argument to get the text
columns in lexigraphical order.
The last stage is to replace the null
values with 0’s so it’s clear what we’re doing with them
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(
fake_news_df
.with_columns(
pl.col("text").str.to_lowercase().str.split(" "),
pl.lit(1).alias("placeholder")
)
.explode("text")
.pivot(
index=["publication","title"],
columns="text",
values="placeholder",
sort_columns=True
)
.fill_null(value=0)
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
shape: (5, 166)
┌─────────────────────┬─────────────────────┬────────┬─────┬───┬─────────┬───────┬──────┬──────────┐
│ publication ┆ title ┆ 10,000 ┆ 100 ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ i32 ┆ ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════════════════════╪═════════════════════╪════════╪═════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists Discover ┆ 0 ┆ 1 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 1 │
│ ┆ New Species … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ Faux News Network ┆ Aliens Land on ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
│ ┆ Earth and Offer t… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Fabricator ┆ Study Shows That ┆ 1 ┆ 0 ┆ … ┆ 0 ┆ 1 ┆ 0 ┆ 0 │
│ ┆ Eating Pizza Ev… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Misleader ┆ New Study Finds ┆ 0 ┆ 0 ┆ … ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
│ ┆ That Smoking is … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ The Hoax Herald ┆ World's Largest ┆ 0 ┆ 1 ┆ … ┆ 1 ┆ 0 ┆ 0 ┆ 0 │
│ ┆ Iceberg Discover… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└─────────────────────┴─────────────────────┴────────┴─────┴───┴─────────┴───────┴──────┴──────────┘
Of course there are still differences with the output of CountVectorizer - for example CountVectorizer returns a sparse matrix by default. In addition Count Vectorizer uses a more sophisticated regex to separate the words - but we can re-implement this by using str.extract_all
instead of .str.split
1
2
3
4
5
6
7
(
fake_news_df
.with_columns(
pl.col("text").str.to_lowercase().str.extract_all('(?u)\\b\\w\\w+\\b'),
pl.lit(1).alias("placeholder")
)
)
So here we’ve seen how we can quickly implement a classic NLP feature engineering method using Polars. I’m sure we’ll see many more examples of Polars as an all-purpose data workhorse in the years to come.
Want to accelerate your analysis with Polars? Join over 3,000 learners on my highly-rated Up & Running with Polars course )
Next steps
Want to know more about Polars for high performance data science? Then you can: