Python Polars: A Lightning-Fast DataFrame Library

Python Polars: A Lightning-Fast DataFrame Library

by Harrison Hoffman Aug 16, 2023 intermediate data-science

In the world of data analysis and manipulation, Python has long been the go-to language. With extensive and user-friendly libraries like NumPy, pandas, PySpark, and Dask, there’s a solution available for almost any data-driven task. Among these libraries, one name that’s been generating a significant amount of buzz lately is Polars.

Polars is a high-performance DataFrame library, designed to provide fast and efficient data processing capabilities. Inspired by the reigning pandas library, Polars takes things to another level, offering a seamless experience for working with large datasets that might not fit into memory.

In this tutorial, you’ll learn:

  • Why Polars is so performant and attention-grabbing
  • How to work with DataFrames, expressions, and contexts
  • What the lazy API is and how to use it
  • How to integrate Polars with external data sources and the broader Python ecosystem

After reading, you’ll be equipped with the knowledge and resources necessary to get started using Polars for your own data tasks. Before reading, you’ll benefit from having a basic knowledge of Python and experience working with tabular datasets. You should also be comfortable with DataFrames from any of the popular DataFrame libraries.

The Python Polars Library

Polars has caught a lot of attention in a short amount of time, and for good reason. In this first section, you’ll get an overview of Polars and a preview of the library’s powerful features. You’ll also learn how to install Polars along with any dependencies that you might need for your data processing task.

Getting to Know Polars

Polars combines the flexibility and user-friendliness of Python with the speed and scalability of Rust, making it a compelling choice for a wide range of data processing tasks. So, what makes Polars stand out among the crowd? There are many reasons, one of the most prominent being that Polars is lightning fast.

The core of Polars is written in Rust, a language that operates at a low level with no external dependencies. Rust is memory-efficient and gives you performance on par with C or C++, making it a great language to underpin a data analysis library. Polars also ensures that you can utilize all available CPU cores in parallel, and it supports large datasets without requiring all data to be in memory.

Another standout feature of Polars is its intuitive API. If you’re already familiar with libraries like pandas, then you’ll feel right at home with Polars. The library provides a familiar yet unique interface, making it easy to transition to Polars. This means you can leverage your existing knowledge and codebase while taking advantage of Polars’ performance gains.

Polars’ query engine leverages Apache Arrow to execute vectorized queries. Exploiting the power of columnar data storage, Apache Arrow is a development platform designed for fast in-memory processing. This is yet another rich feature that gives Polars an outstanding performance boost.

These are just a few key details that make Polars an attractive data processing library, and you’ll get to see these in action throughout this tutorial. Up next, you’ll get an overview of how to install Polars.

Installing Python Polars

Before installing Polars, make sure you have Python and pip installed on your system. Polars supports Python versions 3.7 and above. To check your Python version, open a terminal or command prompt and run the following command:

Shell
$ python --version

If you have Python installed, then you’ll see the version number displayed below the command. If you don’t have Python 3.7 or above installed, follow these instructions to get the correct version.

Polars is available on PyPI, and you can install it with pip. Open a terminal or command prompt, create a new virtual environment, and then run the following command to install Polars:

Shell
(venv) $ python -m pip install polars

This command will install the latest version of Polars from PyPI onto your machine. To verify that the installation was successful, start a Python REPL and import Polars:

Python
>>> import polars as pl

If the import runs without error, then you’ve successfully installed Polars. You now have the core of Polars installed on your system. This is a lightweight installation of Polars that allows you to get started without extra dependencies.

Polars has other rich features that allow you to interact with the broader Python ecosystem and external data sources. To use these features, you need to install Polars with the feature flags that you’re interested in. For example, if you want to convert Polars DataFrames to pandas DataFrames and NumPy arrays, then run the following command when installing Polars:

Shell
(venv) $ python -m pip install "polars[numpy, pandas]"

This command installs the Polars core and the functionality that you need to convert Polars DataFrames to pandas and NumPy objects. You can find the list of optional dependencies that you can install with Polars in the documentation. Alternatively, you can run the following command to install Polars with all the optional dependencies:

Shell
(venv) $ python -m pip install "polars[all]"

This is the best way to go if you feel like you’ll utilize a wide range of Polars features. Otherwise, if you’d like to keep your environment as lightweight as possible, you should only install the optional dependencies that you need.

With Polars installed, you’re now ready to dive in. In the next section, you’ll get an overview of Polars’ core functionalities with DataFrames, expressions, and contexts. You’ll get a feel for Polars syntax and start to see why the library is so powerful.

DataFrames, Expressions, and Contexts

Now that you’ve installed Polars and have a high-level understanding of why it’s so performant, it’s time to dive into some core concepts. In this section, you’ll explore DataFrames, expressions, and contexts with examples. You’ll get a first impression of Polars syntax. If you know other DataFrame libraries, then you’ll notice some similarities but also some differences.

Getting Started With Polars DataFrames

Like most other data processing libraries, the core data structure used in Polars is the DataFrame. A DataFrame is a two-dimensional data structure composed of rows and columns. The columns of a DataFrame are made up of series, which are one-dimensional labeled arrays.

You can create a Polars DataFrame in a few lines of code. In the following example, you’ll create a Polars DataFrame from a dictionary of randomly generated data representing information about houses. Be sure you have NumPy installed before running this example:

Python
>>> import numpy as np
>>> import polars as pl

>>> num_rows = 5000
>>> rng = np.random.default_rng(seed=7)

>>> buildings_data = {
...      "sqft": rng.exponential(scale=1000, size=num_rows),
...      "year": rng.integers(low=1995, high=2023, size=num_rows),
...      "building_type": rng.choice(["A", "B", "C"], size=num_rows),
...  }
>>> buildings = pl.DataFrame(buildings_data)
>>> buildings
shape: (5_000, 3)
┌─────────────┬──────┬───────────────┐
│ sqft        ┆ year ┆ building_type │
│ ---         ┆ ---  ┆ ---           │
│ f64         ┆ i64  ┆ str           │
╞═════════════╪══════╪═══════════════╡
│ 707.529256  ┆ 1996 ┆ C             │
│ 1025.203348 ┆ 2020 ┆ C             │
│ 568.548657  ┆ 2012 ┆ A             │
│ 895.109864  ┆ 2000 ┆ A             │
│ …           ┆ …    ┆ …             │
│ 408.872783  ┆ 2009 ┆ C             │
│ 57.562059   ┆ 2019 ┆ C             │
│ 3728.088949 ┆ 2020 ┆ C             │
│ 686.678345  ┆ 2011 ┆ C             │
└─────────────┴──────┴───────────────┘

In this example, you first import numpy and polars with aliases of np and pl, respectively. Next, you define num_rows, which determines how many rows will be in the randomly generated data. To generate random numbers, you call default_rng() from NumPy’s random module. This returns a generator that can produce a variety of random numbers according to different probability distributions.

You then define a dictionary with the entries sqft, year, and building_type, which are randomly generated arrays of length num_rows. The sqft array contains floats, year contains integers, and the building_type array contains strings. These will become the three columns of a Polars DataFrame.

To create the Polars DataFrame, you call pl.DataFrame(). The class constructor for a Polars DataFrame accepts two-dimensional data in various forms, a dictionary in this example. You now have a Polars DataFrame that’s ready to use!

When you display buildings in the console, a nice string representation of the DataFrame is displayed. The string representation first prints the shape of the data as a tuple with the first entry telling you the number of rows and the second the number of columns in the DataFrame.

You then see a tabular preview of the data that shows the column names and their data types. For instance, year has type float64, and building_type has type str. Polars supports a variety of data types that are primarily based on the implementation from Arrow.

Polars DataFrames are equipped with many useful methods and attributes for exploring the underlying data. If you’re already familiar with pandas, then you’ll notice that Polars DataFrames use mostly the same naming conventions. You can see some of these methods and attributes in action on the DataFrame that you created in the previous example:

Python
>>> buildings.schema
{'sqft': Float64, 'year': Int64, 'building_type': Utf8}

>>> buildings.head()
shape: (5, 3)
┌─────────────┬──────┬───────────────┐
│ sqft        ┆ year ┆ building_type │
│ ---         ┆ ---  ┆ ---           │
│ f64         ┆ i64  ┆ str           │
╞═════════════╪══════╪═══════════════╡
│ 707.529256  ┆ 1996 ┆ C             │
│ 1025.203348 ┆ 2020 ┆ C             │
│ 568.548657  ┆ 2012 ┆ A             │
│ 895.109864  ┆ 2000 ┆ A             │
│ 206.532754  ┆ 2011 ┆ A             │
└─────────────┴──────┴───────────────┘

>>> buildings.describe()
shape: (9, 4)
┌────────────┬─────────────┬───────────┬───────────────┐
│ describe   ┆ sqft        ┆ year      ┆ building_type │
│ ---        ┆ ---         ┆ ---       ┆ ---           │
│ str        ┆ f64         ┆ f64       ┆ str           │
╞════════════╪═════════════╪═══════════╪═══════════════╡
│ count      ┆ 5000.0      ┆ 5000.0    ┆ 5000          │
│ null_count ┆ 0.0         ┆ 0.0       ┆ 0             │
│ mean       ┆ 994.094456  ┆ 2008.5258 ┆ null          │
│ std        ┆ 1016.641569 ┆ 8.062353  ┆ null          │
│ min        ┆ 1.133256    ┆ 1995.0    ┆ A             │
│ max        ┆ 9307.793917 ┆ 2022.0    ┆ C             │
│ median     ┆ 669.370932  ┆ 2009.0    ┆ null          │
│ 25%        ┆ 286.807549  ┆ 2001.0    ┆ null          │
│ 75%        ┆ 1343.539279 ┆ 2015.0    ┆ null          │
└────────────┴─────────────┴───────────┴───────────────┘

You first look at the schema of the DataFrame with buildings.schema. Polars schemas are dictionaries that tell you the data type of each column in the DataFrame, and they’re necessary for the lazy API that you’ll explore later.

Next, you get a preview of the first five rows of the DataFrame with buildings.head(). You can pass any integer into .head(), depending on how many of the top rows you want to see, and the default number of rows is five. Polars DataFrames also have a .tail() method that allows you to view the bottom rows.

Lastly, you call buildings.describe() to get summary statistics for each column in the DataFrame. This is one of the best ways to get a quick feel for the nature of the dataset that you’re working with. Here’s what each row returned from .describe() means:

  • count is the number of observations or rows in the dataset.
  • null_count is the number of missing values in the column.
  • mean is the arithmetic mean, or average, of the column.
  • std is the standard deviation of the column.
  • min is the minimum value of the column.
  • max is the maximum value of the column.
  • median is the median value, or fiftieth percentile, of the column.
  • 25% is the twenty-fifth percentile, or first quartile, of the column.
  • 75% is the seventy-fifth percentile, or third quartile, of the column.

As an example interpretation, the mean year in the data is between 2008 and 2009, with a standard deviation of just above eight years. The building_type column is missing most of the summary statistics because it consists of categorical values represented by strings.

Now that you’ve seen the basics of creating and interacting with Polars DataFrames, you can start trying more sophisticated queries and get a feel for the library’s power. To do this, you’ll need to understand contexts and expressions, which are the topics of the next section.

Polars Contexts and Expressions

Contexts and expressions are the core components of Polars’ unique data transformation syntax. Expressions refer to computations or transformations that are performed on data columns, and they allow you to apply various operations on the data to derive new results. Expressions include mathematical operations, aggregations, comparisons, string manipulations, and more.

A context refers to the specific environment or situation in which an expression is evaluated. In other words, a context is the fundamental action that you want to perform on your data. Polars has three main contexts:

  • Selection: Selecting columns from a DataFrame
  • Filtering: Reducing the DataFrame size by extracting rows that meet specified conditions
  • Groupby/aggregation: Computing summary statistics within subgroups of the data

You can think of contexts as verbs and expressions as nouns. Contexts determine how the expressions are evaluated and executed, just as verbs determine the actions performed by nouns in language. To get started working with expressions and contexts, you’ll work with the same randomly generated data as before. Here’s the code to create the buildings DataFrame again:

Python
>>> import numpy as np
>>> import polars as pl

>>> num_rows = 5000
>>> rng = np.random.default_rng(seed=7)

>>> buildings_data = {
...      "sqft": rng.exponential(scale=1000, size=num_rows),
...      "year": rng.integers(low=1995, high=2023, size=num_rows),
...      "building_type": rng.choice(["A", "B", "C"], size=num_rows),
...  }
>>> buildings = pl.DataFrame(buildings_data)

With the buildings DataFrame created, you’re ready to get started using expressions and contexts. Within Polars’ three main contexts, there are many different types of expressions, and you can pipe multiple expressions together to run arbitrarily complex queries. To better understand these ideas, take a look at an example of the select context:

Python
>>> buildings.select("sqft")
shape: (5_000, 1)
┌─────────────┐
│ sqft        │
│ ---         │
│ f64         │
╞═════════════╡
│ 707.529256  │
│ 1025.203348 │
│ 568.548657  │
│ 895.109864  │
│ …           │
│ 408.872783  │
│ 57.562059   │
│ 3728.088949 │
│ 686.678345  │
└─────────────┘

>>> buildings.select(pl.col("sqft"))
shape: (5_000, 1)
┌─────────────┐
│ sqft        │
│ ---         │
│ f64         │
╞═════════════╡
│ 707.529256  │
│ 1025.203348 │
│ 568.548657  │
│ 895.109864  │
│ …           │
│ 408.872783  │
│ 57.562059   │
│ 3728.088949 │
│ 686.678345  │
└─────────────┘

With the same randomly generated data as before, you see two different contexts for selecting the sqft column from the DataFrame. The first context, buildings.select("sqft"), extracts the column directly from its name.

The second context, buildings.select(pl.col("sqft")), accomplishes the same task in a more powerful way because you can perform further manipulations on the column. In this case, pl.col("sqft") is the expression that’s passed into the .select() context.

By using the pl.col() expression within the .select() context, you can do further manipulations on the column. In fact, you can pipe as many expressions onto the column as you want, which allows you to carry out several operations. For instance, if you want to sort the sqft column and then divide all of the values by 1000, you could do the following:

Python
>>> buildings.select(pl.col("sqft").sort() / 1000)
shape: (5_000, 1)
┌──────────┐
│ sqft     │
│ ---      │
│ f64      │
╞══════════╡
│ 0.001133 │
│ 0.001152 │
│ 0.001429 │
│ 0.001439 │
│ …        │
│ 7.247539 │
│ 7.629569 │
│ 8.313942 │
│ 9.307794 │
└──────────┘

As you can see, this select context returns the sqft column sorted and scaled down by 1000. One context that you’ll often use prior to .select() is .filter(). As the name suggests, .filter() reduces the size of the data based on a given expression. For example, if you want to filter the data down to houses that were built after 2015, you could run the following:

Python
>>> after_2015 = buildings.filter(pl.col("year") > 2015)
>>> after_2015.shape
(1230, 3)

>>> after_2015.select(pl.col("year").min())
shape: (1, 1)
┌───────────────┐
│ building_year │
│ ---           │
│ i64           │
╞═══════════════╡
│ 2016          │
└───────────────┘

By passing the expression pl.col("year") > 2015 into .filter(), you get back a DataFrame that only contains houses that were built after 2015. You can see this because after_2015 only has 1230 of the 5000 original rows, and the minimum year in after_2015 is 2016.

Another commonly used context in Polars, and data analysis more broadly, is the groupby context, also known as aggregation. This is useful for computing summary statistics within subgroups of your data. In the building data example, suppose you want to know the average square footage, median building year, and number of buildings for each building type. The following query accomplishes this task:

Python
>>> buildings.groupby("building_type").agg(
...      [
...          pl.mean("sqft").alias("mean_sqft"),
...          pl.median("year").alias("median_year"),
...          pl.count(),
...      ]
...  )
shape: (3, 4)
┌───────────────┬────────────┬─────────────┬───────┐
│ building_type ┆ mean_sqft  ┆ median_year ┆ count │
│ ---           ┆ ---        ┆ ---         ┆ ---   │
│ str           ┆ f64        ┆ f64         ┆ u32   │
╞═══════════════╪════════════╪═════════════╪═══════╡
│ C             ┆ 999.854722 ┆ 2009.0      ┆ 1692  │
│ A             ┆ 989.539918 ┆ 2009.0      ┆ 1653  │
│ B             ┆ 992.754444 ┆ 2009.0      ┆ 1655  │
└───────────────┴────────────┴─────────────┴───────┘

In this example, you first call buildings.groupby("building_type"), which creates a Polars GroupBy object. The GroupBy object has an aggregation method, .agg(), which accepts a list of expressions that are computed for each group. For instance, pl.mean("sqft") calculates the average square footage for each building type, and pl.count() returns the number of buildings of each building type. You use .alias() to name the aggregated columns.

While it’s not apparent with the high-level Python API, all Polars expressions are optimized and run in parallel under the hood. This means that Polars expressions don’t always run in the order you specify, and they don’t necessarily run on a single core. Instead, Polars optimizes the order in which expressions are evaluated in a query, and the work is spread across available cores. You’ll see examples of optimized queries later in this tutorial.

Now that you have an understanding of Polars contexts and expressions, as well as insight into why expressions are evaluated so quickly, you’re ready to take a deeper dive into another powerful Polars feature, the lazy API. With the lazy API, you’ll see how Polars is able to evaluate sophisticated expressions on large datasets while keeping memory efficiency in mind.

The Lazy API

Polars’ lazy API is one of the most powerful features of the library. With the lazy API, you can specify a sequence of operations without immediately running them. Instead, these operations are saved as a computational graph and only run when necessary. This allows Polars to optimize queries before execution, catch schema errors before the data is processed, and perform memory-efficient queries on datasets that don’t fit into memory.

Working With LazyFrames

The core object within the lazy API is the LazyFrame, and you can create LazyFrames in a few different ways. To get started with LazyFrames and the lazy API, take a look at this example:

Python
>>> import numpy as np
>>> import polars as pl

>>> num_rows = 5000
>>> rng = np.random.default_rng(seed=7)

>>> buildings = {
...      "sqft": rng.exponential(scale=1000, size=num_rows),
...      "price": rng.exponential(scale=100_000, size=num_rows),
...      "year": rng.integers(low=1995, high=2023, size=num_rows),
...      "building_type": rng.choice(["A", "B", "C"], size=num_rows),
...   }
>>> buildings_lazy = pl.LazyFrame(buildings)
>>> buildings_lazy
<polars.LazyFrame object at 0x106D19950>

You first create another toy dataset similar to the one that you worked with earlier, but this example includes a column named price. You then call pl.LazyFrame() to create a LazyFrame from buildings. Alternatively, you can convert an existing DataFrame to a LazyFrame with .lazy(). To see how the lazy API works, you can create the following query:

Python
>>> lazy_query = (
...     buildings_lazy
...     .with_columns(
...         (pl.col("price") / pl.col("sqft")).alias("price_per_sqft")
...     )
...     .filter(pl.col("price_per_sqft") > 100)
...     .filter(pl.col("year") < 2010)
...  )
>>> lazy_query
<polars.LazyFrame object at 0x10B6AF290>

In this query, you compute the price per square foot of each building and assign it the name price_per_sqft. You then filter the data on all buildings with a price_per_sqft greater than 100 and year less than 2010. You may have noticed that the lazy query returns another LazyFrame, rather than actually executing the query. This is the idea behind the lazy API. It only executes queries when you explicitly call them.

Before you execute the query, you can inspect what’s known as the query plan. The query plan tells you the sequence of steps that the query will trigger. To get a nice visual of the LazyFrame query plan, you can run the following code:

Python
>>> lazy_query.show_graph()

The LazyFrame method .show_graph() renders an image representation of the query plan. To see this image, be sure you have Matplotlib installed. If you’re working in a Jupyter Notebook, then the image should render in the output cell. Otherwise, a separate window should pop up with an image similar to this:

Polars' LazyFrame `.show_graph()` displays a graphical representation of the query plan.
A Polars Lazy Query Plan

You read query plan graphs from bottom to top in Polars, and each box corresponds to a stage in the query plan. Sigma (σ) and pi (π) are symbols from relational algebra, and they tell you the operation that you’re performing on the data.

In this example, π */4 says that you’re working with all four columns of the DataFrame, and σ(col(“year”)) < 2010 tells you that you’re only processing rows with a year less than 2010. You can interpret the full query plan with these steps:

  1. Use the four columns of buildings_lazy, and filter buildings_lazy to rows where year is less than 2010.
  2. Create the price_per_sqft column.
  3. Filter buildings_lazy to all rows where price_per_sqft is greater than 100.

One important note is that Polars filters buildings_lazy on year before executing any other part of the query, despite this being the last filter that you specified in the code. This is known as predicate pushdown, a Polars optimization that makes queries more memory efficient by applying filters as early as possible and subsequently reducing the data size before further processing.

You might have noticed that the query plan graph cuts off important details that don’t fit into a box. This is common, especially as your queries become more complex. If you need to see the full representation of your query plan, then you can use the .explain() method:

Python
>>> print(lazy_query.explain())
FILTER [(col("price_per_sqft")) > (100.0)] FROM WITH_COLUMNS:
[[(col("price")) / (col("sqft"))].alias("price_per_sqft")]
DF ["sqft", "price", "year", "building_type"];
⮑ PROJECT */4 COLUMNS; SELECTION: "[(col(\"year\")) < (2010)]"

When you print the output of .explain() on a LazyFrame, you get a string representation of the query plan in return. As with the graphical query plan, you read the string query plan from bottom to top, and each stage is on its own line.

If you call lazy_query.explain() without print(), then you’ll see the string representation of the query plan. As usual, newlines show up as \n in strings, so the plan will be harder to read. Again, you should use .explain() when you need a more verbose explanation of the query plan that’s not contained in the graphical representation.

With an understanding of what your lazy query is set to do, you’re ready to actually execute it. To do this, you call .collect() on your lazy query to evaluate it according to the query plan. Here’s what this looks like in action:

Python
>>> lazy_query = (
...     buildings_lazy
...     .with_columns(
...         (pl.col("price") / pl.col("sqft")).alias("price_per_sqft")
...     )
...     .filter(pl.col("price_per_sqft") > 100)
...     .filter(pl.col("year") < 2010)
...  )

>>> (
...     lazy_query
...     .collect()
...     .select(pl.col(["price_per_sqft", "year"]))
... )
shape: (1_317, 2)
┌────────────────┬──────┐
│ price_per_sqft ┆ year │
│ ---            ┆ ---  │
│ f64            ┆ i64  │
╞════════════════╪══════╡
│ 3268.19045     ┆ 1996 │
│ 274.339166     ┆ 2000 │
│ 296.979717     ┆ 2004 │
│ 378.86472      ┆ 2002 │
│ …              ┆ …    │
│ 2481.810063    ┆ 2009 │
│ 698.203822     ┆ 2008 │
│ 541.067767     ┆ 2005 │
│ 107.170005     ┆ 1995 │
└────────────────┴──────┘

When you run a lazy query with .collect(), you get a regular Polars DataFrame with the results. Because of the filtering criteria, you get only 1317 of the original 5000 rows. You might also notice that all of the price_per_sqft and year values displayed are greater than 100 and less than 2010, respectively. To further verify that the query properly filtered your data, you can take a look at the summary statistics:

Python
>>> (
...     lazy_query
...     .collect()
...     .select(pl.col(["price_per_sqft", "year"]))
...     .describe()
... )
shape: (9, 3)
┌────────────┬────────────────┬─────────────┐
│ describe   ┆ price_per_sqft ┆ year        │
│ ---        ┆ ---            ┆ ---         │
│ str        ┆ f64            ┆ f64         │
╞════════════╪════════════════╪═════════════╡
│ count      ┆ 1317.0         ┆ 1317.0      │
│ null_count ┆ 0.0            ┆ 0.0         │
│ mean       ┆ 1400.622815    ┆ 2002.003037 │
│ std        ┆ 5755.888716    ┆ 4.324595    │
│ min        ┆ 100.02061      ┆ 1995.0      │
│ max        ┆ 90314.966163   ┆ 2009.0      │
│ median     ┆ 296.71958      ┆ 2002.0      │
│ 25%        ┆ 166.351274     ┆ 1998.0      │
│ 75%        ┆ 744.552161     ┆ 2006.0      │
└────────────┴────────────────┴─────────────┘

When you look at the summary statistics with .describe(), you see that the minimum price_per_sqft is around 100, and the maximum year is 2009. This is exactly what you asked for in the lazy query!

Now you have some familiarity with the lazy API, but you might be wondering what the advantage of the lazy API is. If the entire dataset is already stored in memory, why do you need lazy queries to do your analysis? Continue reading to see where the lazy API really shines.

Scanning Data With LazyFrames

In real-world applications, you’ll most likely store your data externally in a static file or database before you do any processing in Python. One of the main superpowers of the lazy API is that it allows you to process large datasets stored in files without reading all the data into memory.

When working with files like CSVs, you’d traditionally read all of the data into memory prior to analyzing it. With Polars’ lazy API, you can minimize the amount of data read into memory by only processing what’s necessary. This allows Polars to optimize both memory usage and computation time.

In the next example, you’ll work with electric vehicle population data from Data.gov. This dataset contains information about electric and hybrid vehicles registered in the Washington State Department of Licensing. Each row in the data represents one car, and each column contains information about the car.

You can manually download this data from the website, or you can use the following function to download the file programmatically. Make sure you have requests installed in your environment before trying this example:

Python
# downloads.py

import requests
import pathlib

def download_file(file_url: str, local_file_path: pathlib.Path) -> None:
    """Download a file and save it with the specified file name."""
    response = requests.get(file_url)
    if response:
        local_file_path.write_bytes(response.content)
        print(f"File successfully downloaded and stored at: {local_file_path}")
    else:
        raise requests.exceptions.RequestException(
            f"Failed to download the file. Status code: {response.status_code}"
        )

This function uses the requests library to download a file from a specified URL. You make a GET request to file_url, and if the request is successful, you store the file locally at local_file_path. You can run the following code to download the electric vehicle population data:

Python
>>> import pathlib
>>> from downloads import download_file

>>> url = "https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD"
>>> local_file_path = pathlib.Path("electric_cars.csv")

>>> download_file(url, local_file_path)
File successfully downloaded and stored at: electric_cars.csv

In this code snippet, you first import download_file() from downloads.py. You then call download_file() on the electric vehicle population data and store it as a CSV. You store the 140,000 rows as electric_cars.csv in the working directory of your Python instance. You’re now ready to interact with the data through the lazy API.

The key to efficiently working with files through the lazy API is to use Polars’ scan functionality. When you scan a file, rather than reading the entire file into memory, Polars creates a LazyFrame that references the file’s data. As before, no processing of the data occurs until you explicitly execute a query. With the following code, you scan electric_cars.csv:

Python
>>> lazy_car_data = pl.scan_csv(local_file_path)
>>> lazy_car_data
<polars.LazyFrame object at 0x10292EC50>

>>> lazy_car_data.schema
{'VIN (1-10)': Utf8, 'County': Utf8, 'City': Utf8, 'State': Utf8,
'Postal Code': Int64, 'Model Year': Int64, 'Make': Utf8, 'Model': Utf8,
'Electric Vehicle Type': Utf8, 'Clean Alternative Fuel Vehicle (CAFV) Eligibility': Utf8,
'Electric Range': Int64, 'Base MSRP': Int64, 'Legislative District': Int64,
'DOL Vehicle ID': Int64, 'Vehicle Location': Utf8, 'Electric Utility': Utf8,
'2020 Census Tract': Int64}

You create a LazyFrame, lazy_car_data, by using scan_csv(). Crucially, the data from the CSV file isn’t stored in memory. Instead, the only thing lazy_car_data stores from electric_cars.csv is the schema from lazy_car_data.schema.

This allows you to see the file’s column names and their respective data types, and it also helps Polars optimize queries that you run on this data. In fact, Polars must know the schema before executing any step of a query plan.

You can now run a query on the data contained in electric_cars.csv using the lazy API. Your queries can have arbitrary complexity, and Polars will only store and process the necessary data. For instance, you could run the following query:

Python
>>> lazy_car_query = (
...     lazy_car_data
...     .filter((pl.col("Model Year") >= 2018))
...     .filter(
...         pl.col("Electric Vehicle Type") == "Battery Electric Vehicle (BEV)"
...     )
...     .groupby(["State", "Make"])
...     .agg(
...         pl.mean("Electric Range").alias("Average Electric Range"),
...         pl.min("Model Year").alias("Oldest Model Year"),
...         pl.count().alias("Number of Cars"),
...     )
...     .filter(pl.col("Average Electric Range") > 0)
...     .filter(pl.col("Number of Cars") > 5)
...     .sort(pl.col("Number of Cars"), descending=True)
... )

>>> lazy_car_query.collect()
shape: (20, 5)
┌───────┬───────────┬────────────────────────┬───────────────────┬────────────────┐
│ State ┆ Make      ┆ Average Electric Range ┆ Oldest Model Year ┆ Number of Cars │
│ ---   ┆ ---       ┆ ---                    ┆ ---               ┆ ---            │
│ str   ┆ str       ┆ f64                    ┆ i64               ┆ u32            │
╞═══════╪═══════════╪════════════════════════╪═══════════════════╪════════════════╡
│ WA    ┆ TESLA     ┆ 89.114509              ┆ 2018              ┆ 55690          │
│ WA    ┆ NISSAN    ┆ 93.115056              ┆ 2018              ┆ 5267           │
│ WA    ┆ CHEVROLET ┆ 111.746651             ┆ 2018              ┆ 5001           │
│ WA    ┆ KIA       ┆ 65.380428              ┆ 2018              ┆ 3178           │
│ …     ┆ …         ┆ …                      ┆ …                 ┆ …              │
│ VA    ┆ TESLA     ┆ 139.133333             ┆ 2018              ┆ 15             │
│ MD    ┆ TESLA     ┆ 50.6                   ┆ 2018              ┆ 10             │
│ TX    ┆ TESLA     ┆ 94.625                 ┆ 2018              ┆ 8              │
│ NC    ┆ TESLA     ┆ 61.428571              ┆ 2018              ┆ 7              │
└───────┴───────────┴────────────────────────┴───────────────────┴────────────────┘

In this query, you filter the data on all cars where the model year is 2018 or later and the electric vehicle type is Battery Electric Vehicle (BEV). You then compute the average electric range, the minimum model year, and the number of cars for each state and make. Lastly, you further filter the data where the average electric range is positive and where the number of cars for the state and make is greater than five.

Because this is a lazy query, no computation is performed until you call lazy_car_query.collect(). After the query is executed, only the data you asked for is stored and returned—nothing more.

Each row in the DataFrame returned from lazy_car_query.collect() tells you the average electric range, oldest model year, and number of cars for each state and make. For example, the first row tells you there are 55,690 Teslas from 2018 or later in Washington State, and their average electric range is around 89.11 miles.

With this example, you saw how Polars uses the lazy API to query data from files in a performant and memory-efficient manner. This powerful API gives Polars a huge leg up over other DataFrame libraries, and you should opt to use the lazy API whenever possible. In the next section, you’ll get a look at how Polars integrates with external data sources and the broader Python ecosystem.

Seamless Integration

Polars can read from most popular data sources, and it integrates well with other commonly used Python libraries. This means, for many use cases, Polars can replace whatever data processing library you’re currently using. In this section, you’ll walk through examples of Polars’ flexibility in working with different data sources and libraries.

Integration With External Data Sources

In the previous section, you saw how Polars performs lazy queries over CSV files with scan_csv(). Polars can also handle data sources like JSON, Parquet, Avro, Excel, and various databases. You can interact with most of these file types the same way you worked with the CSV file:

Python
>>> import polars as pl

>>> data = pl.DataFrame({
...     "A": [1, 2, 3, 4, 5],
...     "B": [6, 7, 8, 9, 10],
... })

>>> data.write_csv("data.csv")
>>> data.write_ndjson("data.json")
>>> data.write_parquet("data.parquet")

In this example, you export the results of your work in Polars with various file formats. You first create a DataFrame with columns A and B. You then write the data to CSV, JSON, and Parquet files in the working path of your Python instance. These files are now ready for you to share and read, and Polars makes this quite straightforward:

Python
>>> data_csv = pl.read_csv("data.csv")
>>> data_csv_lazy = pl.scan_csv("data.csv")
>>> data_csv_lazy.schema
{'A': Int64, 'B': Int64}

>>> data_json = pl.read_ndjson("data.json")
>>> data_json_lazy = pl.scan_ndjson("data.json")
>>> data_json_lazy.schema
{'A': Int64, 'B': Int64}

>>> data_parquet = pl.read_parquet("data.parquet")
>>> data_parquet_lazy = pl.scan_parquet("data.parquet")
>>> data_parquet_lazy.schema
{'A': Int64, 'B': Int64}

In this example, you read and scan each of the three files that you previously created, and you print their schemas to confirm that the column names and data types are correct. Polars’ ability to scan these file types also means that the data can be quite large, and you can execute lazy queries to handle this.

Polars supports other file types, and its file capabilities are constantly improving. Polars can also scan and read multiple files with the same schema as if they were a single file. Lastly, Polars can connect directly to a database and execute SQL queries.

Overall, Polars provides a full suite of tools for interacting with commonly used data sources. Next, you’ll see how Polars integrates with other Python libraries, making the library flexible enough to drop into existing code with minimal overhead.

Integration With the Python Ecosystem

Polars integrates seamlessly with existing Python libraries. This is a crucial feature because it allows you to drop Polars into existing code without having to change your dependencies or do a big refactor. In the following example, you can see how Polars DataFrames seamlessly convert between NumPy arrays and pandas DataFrames:

Python
>>> import numpy as np
>>> import pandas as pd
>>> import polars as pl

>>> polars_data = pl.DataFrame({
...     "A": [1, 2, 3, 4, 5],
...     "B": [6, 7, 8, 9, 10]
... })

>>> pandas_data = pd.DataFrame({
...     "A": [1, 2, 3, 4, 5],
...     "B": [6, 7, 8, 9, 10]
... })

>>> numpy_data = np.array([
...     [1, 2, 3, 4, 5],
...     [6, 7, 8, 9, 10]
... ]).T

After importing numpy, pandas, and polars, you create identical datasets using the three libraries. You’ll use the Polars DataFrame, pandas DataFrame, and NumPy array to see how interoperable these libraries are. For example, you can convert the pandas DataFrame and NumPy array to Polars DataFrames with the following functions:

Python
>>> pl.from_pandas(pandas_data)
shape: (5, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 7   │
│ 3   ┆ 8   │
│ 4   ┆ 9   │
│ 5   ┆ 10  │
└─────┴─────┘

>>> pl.from_numpy(numpy_data, schema={"A": pl.Int64, "B": pl.Int64})
shape: (5, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 7   │
│ 3   ┆ 8   │
│ 4   ┆ 9   │
│ 5   ┆ 10  │
└─────┴─────┘

Here, pl.from_pandas() converts your pandas DataFrame to a Polars DataFrame. Similarly, pl.from_numpy() converts your NumPy array to a Polars DataFrame. If you want your columns to have the right data types and names, then you should specify the schema argument when calling pl.from_numpy(). If you want to convert your Polars DataFrame back to pandas or NumPy, then you can do the following:

Python
>>> polars_data.to_pandas()
   A   B
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10

>>> polars_data.to_numpy()
array([[ 1,  6],
       [ 2,  7],
       [ 3,  8],
       [ 4,  9],
       [ 5, 10]])

You use .to_pandas() and .to_numpy() to convert your Polars DataFrame to a pandas DataFrame and NumPy array. Conveniently, .to_pandas() and .to_numpy() are methods of the Polars DataFrame object. The creators of Polars anticipated that many users would want to integrate their Polars code with existing pandas and NumPy code, so they decided to make the conversion between Polars and pandas or NumPy a native action of the DataFrame object.

Because of their widespread use in the Python community, libraries like pandas and NumPy are here to stay for a while. Polars’ ability to integrate with these libraries means that you can introduce it as a way to improve the performance of an existing workflow. For instance, you could use Polars to do intensive data preprocessing for a machine learning model, and then convert the results to a NumPy array before feeding it to your model.

Next Steps

Polars is a rapidly evolving library that has come to fame quickly in the Python community. You only scratched the surface in this tutorial, and there are many other features that you can learn about to improve your data processing applications with Polars.

Some other prominent features that you didn’t cover in this tutorial are joins, melts, pivots, time series processing, and integration with cloud computing platforms. You can find information on these features in Polars’ user guide or the API reference.

Conclusion

Polars is a lightning-fast and rapidly growing DataFrame library. Polars’ optimized back end, familiar yet efficient syntax, lazy API, and integration with the Python ecosystem make the library stand out among the crowd. You’ve now gotten a broad overview of Polars, and you have the knowledge and resources necessary to get started using Polars in your own projects.

In this tutorial, you’ve learned:

  • Where Polars gets its performance from and why the library is trending
  • How to use DataFrames, expressions, and contexts to manipulate data
  • What the Lazy API is and how to use it
  • How to integrate Polars with external data sources and other popular Python libraries

With all of these rich features, Polars is undoubtedly a valuable addition to the data analysis and manipulation ecosystem in Python. Whether you’re working with large datasets, need to optimize performance, or require efficient querying, Polars is a compelling choice for your data processing tasks. How can Polars boost your next data project?

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Harrison Hoffman

Harrison is an avid Pythonista, Data Scientist, and Real Python contributor. He has a background in mathematics, machine learning, and software development. Harrison lives in Texas with his wife, identical twin daughters, and two dogs.

» More about Harrison

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Master Real-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

Master Real-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.


Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!

Keep Learning

Related Tutorial Categories: intermediate data-science