Reducing Pandas memory usage #3: Reading in chunks

Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?

By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory.

Let’s see how you can do this with Pandas’ chunksize option.

Reading the full file

We’ll start with a program that just loads a full CSV into memory. In particular, we’re going to write a little program that loads a voter registration database, and measures how many voters live on every street in the city:

import pandas

voters_street = pandas.read_csv(
    "voters.csv")["Residential Address Street Name "]
print(voters_street.value_counts())

If we run it we get:

$ python voter-by-street-1.py 
MASSACHUSETTS AVE           2441
MEMORIAL DR                 1948
HARVARD ST                  1581
RINDGE AVE                  1551
CAMBRIDGE ST                1248
                            ... 
NEAR 111 MOUNT AUBURN ST       1
SEDGEWICK RD                   1
MAGAZINE BEACH PARK            1
WASHINGTON CT                  1
PEARL ST AND MASS AVE          1
Name: Residential Address Street Name , Length: 743, dtype: int64

Where is memory being spent? As you would expect, the bulk of memory usage is allocated by loading the CSV into memory.

In the following graph of peak memory usage, the width of the bar indicates what percentage of the memory is used:

  • The section on the left is the CSV read.
  • The narrower section on the right is memory used importing all the various Python modules, in particular Pandas; unavoidable overhead, basically.

You don’t have to read it all

As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.

In particular, if we use the chunksize argument to pandas.read_csv, we get back an iterator over DataFrames, rather than one single DataFrame. Each DataFrame is the next 1000 lines of the CSV:

import pandas

result = None
for chunk in pandas.read_csv("voters.csv", chunksize=1000):
    voters_street = chunk[
        "Residential Address Street Name "]
    chunk_result = voters_street.value_counts()
    if result is None:
        result = chunk_result
    else:
        result = result.add(chunk_result, fill_value=0)

result.sort_values(ascending=False, inplace=True)
print(result)

When we run this we get basically the same results:

$ python voter-by-street-2.py 
MASSACHUSETTS AVE           2441.0
MEMORIAL DR                 1948.0
HARVARD ST                  1581.0
RINDGE AVE                  1551.0
CAMBRIDGE ST                1248.0
                             ...

If we look at the memory usage, we’ve reduced memory usage so much that the memory usage is now dominated by importing Pandas; the actual code barely uses anything:

The MapReduce idiom

Taking a step back, what we have here is an highly simplified instance of the MapReduce programming model. While typically used in distributed systems, where chunks are processed in parallel and therefore handed out to worker processes or even worker machines, you can still see it at work in this example.

In the simple form we’re using, MapReduce chunk-based processing has just two steps:

  1. For each chunk you load, you map or apply a processing function.
  2. Then, as you accumulate results, you “reduce” them by combining partial results into the final result.

We can re-structure our code to make this simplified MapReduce model more explicit:

import pandas
from functools import reduce

def get_counts(chunk):
    voters_street = chunk[
        "Residential Address Street Name "]
    return voters_street.value_counts()

def add(previous_result, new_result):
    return previous_result.add(new_result, fill_value=0)

# MapReduce structure:
chunks = pandas.read_csv("voters.csv", chunksize=1000)
processed_chunks = map(get_counts, chunks)
result = reduce(add, processed_chunks)

result.sort_values(ascending=False, inplace=True)
print(result)

Both reading chunks and map() are lazy, only doing work when they’re iterated over. As a result, chunks are only loaded in to memory on-demand when reduce() starts iterating over processed_chunks.

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage
A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks

From full reads to chunked reads

You’ll notice in the code above that get_counts() could just as easily have been used in the original version, which read the whole CSV into memory:

def get_counts(chunk):
    voters_street = chunk[
        "Residential Address Street Name "]
    return voters_street.value_counts()
result = get_counts(pandas.read_csv("voters.csv"))

That’s because reading everything at once is a simplified version of reading in chunks: you only have one chunk, and therefore don’t need a reducer function.

So here’s how you can go from code that reads everything at once to code that reads in chunks:

  1. Separate the code that reads the data from the code that processes the data.
  2. Use the new processing function, by mapping it across the results of reading the file chunk-by-chunk.
  3. Figure out a reducer function that can combine the processed chunks into a final result.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.