Reducing Pandas memory usage #2: lossy compression

If you want to process a large amount data with Pandas, there are various techniques you can use to reduce memory usage without changing your data. But what if that isn’t enough? What if you still need to reduce memory usage?

Another technique you can try is lossy compression: drop some of your data in a way that doesn’t impact your final results too much. If parts of your data don’t impact your analysis, no need to waste memory keeping extraneous details around.

In particular, in this article we’ll cover the following techniques:

  1. Changing numeric column representation.
  2. Sampling.

Technique #1: Changing numeric representations

Let’s say you have a DataFrame with a column that represents the likelihood that a registered voter will actually vote. The initial representation is a floating point number between 0 and 1, loaded as float64 by default:

>>> df = pd.read_csv("/tmp/voting.csv")
>>> df["likelihood"].memory_usage()
8000128
>>> df["likelihood"].dtype
dtype('float64')
>>> df.head()
   Unnamed: 0  likelihood
0           0    0.894364
1           1    0.715366
2           2    0.626712
3           3    0.138042
4           4    0.429280

Now, for most purposes having a huge amount of accuracy isn’t too important. So one thing we can do is change from float64 to float32, which will cut memory usage in half, in this case with only minimal loss of accuracy:

>>> df["likelihood"].memory_usage()
4000128
>>> df.head()
   Unnamed: 0  likelihood
0           0    0.894364
1           1    0.715366
2           2    0.626712
3           3    0.138042
4           4    0.429280

But we can do even better. If we’re willing to lose some more detail, we can reduce memory usage to 1/8th of the original size.

Instead of representing the values as floating numbers, we can represent them as percentages between 0 and 100. We’ll be down to two-digit accuracy, but again for many use cases that’s sufficient. Plus, if this is output from a model, those last few digits of “accuracy” are likely to be noise, they won’t actually tell us anything useful.

Whole percentages have the nice property that they can fit in a single byte, an int8—as opposed to float64, which uses eight bytes:

>>> likelihood_percentage = numpy.round(
...     df["likelihood"] * 100).astype("int8")
>>> likelihood_percentage.head()
0    89
1    72
2    63
3    14
4    43
Name: likelihood, dtype: int8
>>> likelihood_percentage.memory_usage()
1000128

We can write that data out to a CSV, and then later we can just load the smaller data:

>>> df["likelihood"] = likelihood_percentage
>>> df.to_csv("voting_smaller.csv")
>>> df = pd.read_csv("voting_smaller.csv",
...                  dtype={"likelihood": "int8"})

Technique #2: Sampling

Let’s say you want to do a phone survey of voters in your city. You aren’t going to call all one million people, you’re going to call a sample, let’s say 1000. This can be considered a form of lossy compression, since you only want to use a subset of the rows.

How do you load only a subset of the rows?

When you load your data, you can specify a skiprows function that will randomly decide whether to load that row or not:

>>> from random import random
>>> def sample(row_number):
...     if row_number == 0:
...         # Never drop the row with column names:
...         return False
...     # random() returns uniform numbers between 0 and 1:
...     return random() > 0.001
... 
>>> sampled = pd.read_csv("/tmp/voting.csv", skiprows=sample)
>>> len(sampled)
973

In this example, we want ~1000 rows out of 1000000, so we skip 99.9% of values at random.

There are other use cases where sampling is useful: will your graph really look that different if you have a sample of 10,000 vs. all million calculated values?

Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.

Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.

A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage
A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks

Get creative!

Lossy compression is often about the specific structure of your data, and your own personal understanding of which details matter and which details don’t. So if you’re running low on memory, think about what data you really need, and what alternative representations can make it smaller.

And if compression still isn’t enough, you can also try process your data in chunks

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.