Beyond cProfile: Choosing the right tool for performance optimization

Your Python program is too slow. Maybe your web application can’t keep up, or certain queries are taking a long time. Maybe you have a batch program that takes hours or even days to run.

How do you speed up?

The basic process you’re likely to follow is:

  1. Choose the right tool to measure speed.
  2. Use the tool to figure out the bottleneck.
  3. Fix the bottleneck.

This article will focus on the first step: choosing the right tool. And in particular, it will cover:

  • cProfile: The Python standard library deterministic profiler.
  • Pyinstrument: A sampling profiler.
  • Eliot: A logging library.

I won’t go into a huge amount of detail about how to use these tools, because the goal is to help you choose the right one. But I will explain what these tools do, and when and why you would choose one over the other.

cProfile: A deterministic profiler

The cProfile profiler is built-in to Python, so you’ve probably heard of it, and it may be the default tool you use. It works by tracing every function call in a program. That’s why it’s a deterministic profiler: if you run it with the same inputs it’ll give the same output.

By default cProfile measures wallclock time—how much time elapsed during the function run. That means it can measure both CPU slowness and other problems, like sleeping or lock contention. You should always ask what your profiler is measuring, because different measures can detect different problems.

While cProfile is always available in your Python install, it also has some problems—and as you’ll see, most of the time you don’t want to use it.

Note: A previous version of this article (and possibly the talk it was based on) erroneously stated that cProfile measured CPU time. Thanks to Sergio López Montolio for pointing out this error.

Using cProfile

Using cProfile is pretty easy. If you have a script you usually run directly like so:

$ python benchmark.py
7855 messages/sec

Then you can just prefix python -m cProfile to run it under the profiler:

$ python -m cProfile benchmark.py
1444992 function calls (1443814 primitive calls) in 1.208 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    1.208    1.208 benchmark.py:7(<module>)
     1    0.048    0.048    1.151    1.151 benchmark.py:21(run)
 50000    0.031    0.000    0.657    0.000 _output.py:182(write)
 50000    0.045    0.000    0.618    0.000 _output.py:73(send)
 50000    0.064    0.000    0.566    0.000 _output.py:452(__call__)
... a million more lines of output ...

There’s also a Python profiling API, so you can profile particular functions in a Python interpreter prompt or a Jupyter notebook.

The output format is a table, which isn’t ideal: each row is a function call that ran during the profiled time span, but you don’t know how that function call is related to other function calls. So if you have a function that can be reached from multiple code paths, it can be difficult to figure out which code path was responsible for calling the slow function.

What cProfile can tell you

If you look at the table above, you can see that:

  • _output.py(__call__) was called 50,000 times. It’s an even number because this is a benchmark script that runs the same code in a loop 10,000 times. If you weren’t deliberately calling a function many times, this can be useful for spotting high number of calls is useful for identifying busy inner loops.
  • _output.py(send) used 0.618 seconds (including the time spent in functions it called), and 0.045 seconds (not including functions it called).

Using this information you can spot slow functions you can optimize.

How it works

cProfile measures every single function call. In particular, every function call in the run gets wrapped like so:

start = time()
try:
    f()
finally:
    elapsed = time() - start

The profiler records time at the start and finish, and the difference is allocated to that function’s account.

The problems with cProfile

While cProfile is always available in any Python install, it also suffers from some significant problems.

Problem #1: High overhead and distorted results

As you can imagine, doing extra work for every function call has some costs:

$ python benchmark.py
7855 messages/sec

$ python -m cProfile benchmark.py
5264 messages/sec
... cProfile output ...

Notice how much slower the cProfile run is. And what’s worse, the slowdown isn’t uniform across your program: because it’s tied to number of function calls, parts of your code with more function calls will be slowed down more. So this overhead can distort the results.

Problem #2: Too much information

If you’ll recall the cProfile output we saw above, it includes a row for every single function that was called during the program run. Most of those function calls are irrelevant to our performance problem: they run quickly, and only once or twice.

So when you’re reading cProfile output, you’re dealing with a lot of extra noise masking the signal.

Problem #3: Offline measure of performance

Quite often your program will only be slow when run under real-world conditions, with real-world inputs. Maybe only particular queries from users slow down your web application, and you don’t know which queries. Maybe your batch program is only slow with real data.

But cProfile as we saw slows do your program quite a bit, and so you likely don’t want to run it in your production environment. So while the slowness is only reproducible in production, cProfile only helps you in your development environment.

Problem #4: Performance is measured for functions only

cProfile can tell you “slowfunc() is slow”, where it averages out all the inputs to that function. And that’s fine if the function is always slow.

But sometimes you have some algorithmic code that is only slow for specific inputs. It’s quite possible that:

  • slowfunc(100) is fast.
  • slowfunc(0) is slow.

cProfile will not be able to tell you which inputs caused the slowness, which can make it more difficult to diagnose the problem.

cProfile: Usually insufficient

As a result of these problems, cProfile shouldn’t be your performance tool of choice. Instead, next we’ll be talking about two alternatives:

Pyinstrument: a sampling profiler

Pyinstrument solve two of the problems we covered above:

  • It has lower overhead than cProfile, and it doesn’t distort the results.
  • It filters out irrelevant function calls, so there’s less noise.

Pyinstrument measures elapsed wallclock time, not CPU time, so it too can catch slowness caused by network requests, disk writes, lock contention, and so on.

How you use it

Using Pyinstrument is similar to cProfile; just add a prefix to your script:

$ python benchmark.py
7561 messages/sec

$ python -m pyinstrument benchmark.py
6760 messages/sec
... pyinstrument output ...

Notice it has some overhead, but not as much as cProfile—and the overhead is uniform.

Pyinstrument also has a Python API, so you can use it to profile particular pieces of code in a Python interactive interpreter or a Jupyter notebook.

The output

Pyinstrument’s output is a a tree of calls, measuring wall clock time:

1.506 <module>  benchmark.py:7
├─ 1.477 run  benchmark.py:21
│  ├─ 0.604 attach  myapp/_code.py:52
│  │  └─ 0.598 write  myapp/_code.py:130
│  │     ├─ 0.557 final_write  myapp/_foo.py:181
...

Unlike cProfile’s row-per-function, Pyinstrument gives you a tree of function calls, so you can see the context of the slowness. A function might appear multiple times if the slowness is caused by multiple code paths.

As a result, Pyinstrument’s output is much easier to interpret, and gives you much better understanding of your program’s performance structure than cProfile’s default output.

How it works (cat edition)

Imagine you have a cat. You wish to know how that cat spends its time.

You could spy on its every moment, but that would be a lot of work. So instead you decide to takes samples: every 5 minutes you stick your head into the room where the cat is, and write down what it’s doing.

For example:

  • 12:00: Sleeping 💤
  • 12:05: Sleeping 💤
  • 12:10: Eating 🍲
  • 12:15: Using the litterbox 💩
  • 12:20: Sleeping 💤
  • 12:25: Sleeping 💤
  • 12:30: Sleeping 💤

A few days later you can summarize your observations:

  • 80%: Sleeping 💤
  • 10%: Eating 🍲
  • 9%: Using the litterbox 💩
  • 1%: Staring longingly through the window at birds 🐦

So how accurate is this summary? Insofar as your goal is to measure where the cat spent most of its time, it’s probably accurate. And the more frequent the observations (==samples) and the more observations you make, the more accurate the summary.

If your cat spends most of its time sleeping, you would expect most sampled observations to show that it’s sleeping. And yes, you’ll miss some quick and rare activities—but for purposes of “what did the cat spend most of its time on” those quick and rare activities are irrelevant.

How it works (software edition)

Much like our cat, Pyinstrument samples a Python program’s behavior at intervals: every 1ms it checks what function is currently running. That means:

  • If a function is cumulatively slow, it’ll show up often.
  • If a function is cumulatively fast, we usually won’t see it at all.

That means our performance summary has less noise: functions that are barely used will mostly be skipped. But overall the summary is pretty accurate in terms of how the program spent its time, so long as we took enough samples.

Eliot: A logging library

The final tool we’ll cover in detail is Eliot, a logging library that I wrote. It solves the other two problems we saw with cProfile:

  • Logging can be used in production.
  • Logging can record arguments to functions.

As you’ll see, Eliot provides some unique capabilities that make it better at recording performance than normal logging libraries. That being said, with some extra work you can get the same benefits from other logging libraries as well.

Adding logging to existing code

Consider the follow sketch of a program:

def add(a, b):
    # ... implementation ...
    
def multiply(a, b):
    # ... implementation ...

def multiplysum(a, b, c):
    return multiply(add(a, b), c)

print(multiplysum(1, 2, 4))

We can take this code and add some logging to it:

from eliot import log_call, to_file
to_file(open("out.log", "w"))

@log_call
def add(a, b):
    # ... implementation ...

@log_call
def multiply(a, b):
    # ... implementation ...
# etc.

Specifically, we do two things:

  1. Tell Eliot where to output the log messages (in this case, a file called “out.log”).
  2. We decorate the functions with a @log_call decorator. This will log the fact the function was called, its arguments, and the return value (or raised exception).

Eliot has other, more fine-grained APIs, but @log_call suffices for demonstrating the benefits of logging.

Eliot’s output

Once we run the program, we can look at the logs using a tool called eliot-tree:

$ python calc.py
12
$ eliot-tree out.log
─── multiplysum (inputs a=1 b=2 c=4) ⧖ 10.0s
    ├── add (inputs a=1 b=2) ⧖ 1.0s
    │   └── result: 3
    ├── multiply (inputs a=3 b=4) ⧖ 9.0s
    │   └── result: 12
    └── result: 12

Notice that, a little like Pyinstrument, we’re looking at a tree of actions. I’ve simplified the output a little—originally so it could fit on a slide I used in the talk version of this article—but even in a prose article it lets us focus on the performance aspect.

In Eliot, each action has a start and a finish, and can start other actions—thus the resulting tree. Since we know when each logged action starts and finishes, we also know how long it took.

In this case, each action maps one-to-one with a function call. And there are some differences from Pyinstrument’s output:

  1. Instead of combining multiple function calls, you see each individual call separately.
  2. You can see the arguments and return result of each call.
  3. You can see the elapsed wallclock time of each action.

For example, you can see that multiplysum() took 10 seconds, but the vast majority of the time was spent in multiply(), with the inputs of 3 and 4. So you immediately know that for performance optimization you want to focus on multiply(), and you have some starting inputs (3 and 4) to play around with.

The limitations of logging

Logging isn’t sufficient on its own as a source of performance information.

First, you only get information from code where you’ve explicitly added logging calls. A profiler can run on any arbitrary piece of code with no advance preparation, but with logging you have to do some work in advance.

If you didn’t add logging code, you won’t get any information. Eliot makes this a bit better, since the tree-of-actions structure gives you some sense where time was spent, but it’s still not sufficient if logging is too sparse.

Second, you can’t add logging everywhere because that will slow down your program. Logging isn’t cheap—it’s higher overhead than cProfile. So you need to add it judiciously, in key points where it will maximize the information it gives without impacting performance.

Choosing the right tools

So when should you use each tool?

1. Always add logging

Any non-trivial program probably needs some logging, if only to catch bugs and errors. And if you’re already adding logging, you can take the trouble to log the information you need to do performance debugging as well.

Eliot makes it easier, since logging actions inherently gives you elapsed time, but you can with some extra work do this with any logging library.

2. Next, use a sampling profiler like Pyinstrument

Logging might help you spot the specific place your program is slow, and at the very least some inputs causing slowness, but it’s often insufficient. So the next step is to use a profiler, and in particular a sampling profiler like Pyinstrument:

  • It has low overhead, and more importantly doesn’t distort the results.
  • It measures wallclock time, so it doesn’t assume CPU is bottleneck.
  • It only outputs the slower functions, omitting the irrelevant fast functions.

3. Use cProfile if you need a custom cost metric

If you ever need to write a custom profiler, cProfile allows you to plug in different cost functions, making it an easy tool for measuring more unusual metrics.

You can measure:

  • CPU only.
  • Not-CPU only, all the time spent waiting for non-CPU events.
  • The number of voluntary context switches, i.e. the number of system calls that take a long time.
  • Memory allocations.
  • More broadly, any counter that goes up.

TL;DR

As a good starting point on performance optimization tools, I suggest that you:

  1. Log key inputs and outputs, and the elapsed time of key actions, using Eliot or some other logging library.
  2. Use Pyinstrument—or another sampling profiler—as your default profiler.
  3. Use cProfile when if you need a custom profiler.