10 Polars One-Liners for Speeding Up Data Workflows

Image by Editor

# Introduction

Pandas is undoubtedly a formidable and versatile library to manage and analyze data workflows, something foundational in the bigger picture of data science. Yet, when dataset sizes become very large, it might not be the most efficient option because it operates mainly in a single thread and relies heavily on Python’s interpreter, which can lead to significant processing time.

This article shifts the focus to a newer library that speeds up Pandas-like operations: Polars. In particular, I will share with you 10 insightful Polars one-liners to streamline and speed up daily data manipulation and processing tasks.

Before starting, do not forget to import polars as pl first!

# 1. Loading CSV Files

Polars’ method to read a dataset from a CSV file looks very similar to its Pandas counterpart, except that it’s multithreaded (and internally written in Rust), allowing it to load data in a much more efficient manner. This example shows how to load a CSV file into a Polars DataFrame.

df = pl.read_csv("dataset.csv")

Even for a medium-sized dataset (not just extremely large ones), the difference in time taken to read the file with Polars could be about 5 times faster than using Pandas.

# 2. Lazy Loading for More Scalable Workflows

Creating a so-called “lazy dataframe” rather than eagerly reading it in one go is an approach that enables chaining subsequent operations throughout a data workflow, only executing them when the collect() method is eventually called — a very handy strategy for large-scale data pipelines! Here’s how to apply lazy dataframe loading using the scan_csv() method:

df_lazy = pl.scan_csv("dataset.csv")

# 3. Selecting and Renaming Relevant Columns

To make things easier and clearer in subsequent processing, it’s a good idea to ensure you are only dealing with columns of the dataset that are relevant to your data science or analysis project. Here’s how to do it efficiently with Polars dataframes. Suppose you are using a customer dataset like this one. You can then use the following one-liner to select relevant columns of your choice, as follows:

df = df.select([pl.col("Customer Id"), pl.col("First Name")])

# 4. Filtering for a Subset of Rows

Of course, we can also filter specific rows, e.g. customers, the Polars way. This one-liner is used to filter customers living in a specific city.

df_filtered = df.filter(pl.col("City") == "Hatfieldshire")

You may want to use a method like display() or head() to see the result of this “query”, i.e. the rows fulfilling the specified criteria.

# 5. Grouping by Category and Computing Aggregations

With operations like grouping and aggregations, the value of Polars’ efficiency truly starts to show in larger datasets. Take this one-liner as an example: the key here is combining group_by on a categorical column, with agg() to perform an aggregation for all rows in each group, e.g. an average on a numeric column, or simply a count of rows in each group, as shown below:

df_city = df.group_by("City").agg([pl.len().alias("num_customers")])

Be careful! In Pandas, the groupby() does not have an underscore symbol, but in Polars, it does.

# 6. Creating Derived Columns (Simple Feature Engineering)

Thanks to Polars’ vectorized computation capabilities, creating new columns from arithmetic operations on existing ones is significantly quicker. This one-liner demonstrates this (now considering the popular California housing dataset for examples that follow!):

df = df.with_columns((pl.col("total_rooms") / pl.col("households")).alias("rooms_per_household"))

# 7. Applying Conditional Logic

Continuous attributes like income levels or similar attributes can be categorized and turned into labeled segments, all in a vectorized and overhead-free manner. This example does so to create an income_category column based on median income per district in California:

df = df.with_columns(pl.when(pl.col("median_income") > 5).then(pl.lit("High")).otherwise(pl.lit("Low")).alias("income_category"))

# 8. Executing a Lazy Pipeline

This one-liner, while a bit larger, puts together several of the ideas seen in previous examples to create a lazy pipeline that is executed with the collect method. Remember: for this lazy approach to work, you need to use one-liner number 2 to read your dataset file “the lazy way”.

result = (pl.scan_csv("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv")
        .filter(pl.col("median_house_value") > 200000)
        .with_columns((pl.col("total_rooms") / pl.col("households")).alias("rooms_per_household"))
        .group_by("ocean_proximity").agg(pl.mean("rooms_per_household").alias("avg_rooms_per_household"))
        .sort("avg_rooms_per_household", descending=True)
        .collect())

# 9. Joining Datasets

Let’s suppose we had an additional dataset called region_stats.csv with statistical information collected for the California districts. We could then use a one-liner like this to apply join operations on a specific categorical column, as follows:

df_joined = df.join(pl.read_csv("region_stats.csv"), on="ocean_proximity", how="left")

The result would be an efficient combination of the housing data with district-level metadata, via Polars’ multi-threaded joins that preserve performance even across larger datasets.

# 10. Performing Rolling Computations

In highly fluctuating data variables, rolling aggregates are useful to smooth, for instance, average house values across latitudes and longitudes. This one-liner illustrates how to apply such a fast, vectorized operation: perfect for temporal or geographic sequences.

df = df.sort("longitude").with_columns(pl.col("median_house_value").rolling_mean(window_size=7).alias("rolling_value_avg"))

# Wrapping Up

In this article, we have listed 10 handy one-liners for using Polars efficiently as a fast alternative to Pandas for handling large datasets. These one-liners encapsulate fast, optimized strategies for handling large volumes of data in less time. Employee these next time you work with Polars in your projects and you will undoubtedly see a variety of improvements.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Source link