The hidden performance overhead of Python C extensions

Python is slow, and compiled languages like Rust, C, or C++ are fast. So when your application is too slow, rewriting some of your code in a compiled extension can seem like the natural approach to speeding things up.

Unfortunately, compiled extensions are sometimes actually slower than the equivalent Python code. And even when they’re faster, the performance improvement might be far less than you’d imagine, due to hidden overhead caused by two factors:

  1. Function call overhead.
  2. Serialization/deserialization overhead.

Let’s see where these hidden performance overheads comes from, and then see some solutions to get around them.

Problem #1: Call overhead

The first performance overhead we’re going to face is that of function calls. Let’s write a function in Cython, a Python variant language that compiles to C, and see how long it takes to run it.

Here’s the Cython function:

def do_nothing():
    pass

We can use the IPython %timeit magic function to measure performance:

In [1]: from overhead_cythong import do_nothing

In [2]: %timeit do_nothing()
30 ns ± 0.0352 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: def py_do_nothing(): pass

In [4]: %timeit py_do_nothing()
62.5 ns ± 0.114 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Calling the Cython function is faster than calling a Python function call, it’s true. But even 30 nanoseconds is rather slow by the standards of compiled languages: for comparison, a C function called by another C function might take only 3 nanoseconds, or much less if it gets inlined.

Problem #2: (De)serialization overhead

Beyond the overhead of calling the extension, there is the overhead of getting arguments into the function, and getting the result back. The way Python represents objects is different than how C or Rust represent it, and so a conversion process is necessary. And the conversion code also needs error handling in case it fails.

The result is more overhead. For example, here’s a Cython function that takes a Python list of integers, sums them, and returns the result:

def sum(values):
    cdef long result = 0
    cdef long v
    for v in values:
        result += v
    return result

We can compare performance to Python:

In [1]: from example_cython import sum as cython_sum

In [2]: l = list(range(1000000))

In [3]: sum(l), cython_sum(l)
Out[3]: (499999500000, 499999500000)

In [4]: %timeit sum(l)
7.64 ms ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit cython_sum(l)
6.99 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The compiled Cython code is no faster than Python’s built-in sum(). And that’s not surprising: sum() is written in C, and the actual math is quite fast as we’ll see below. All the runtime is spent converting Python objects into C integers.

So how do we solve these two forms of overhead?

Solution #1: Manage data inside the extensions

The first solution is to manage all of our data using the Rust or C extension, which means we don’t have all the serialization overhead.

This is what NumPy does with arrays: in theory, it could have used Python lists, but then it would have suffered from (de)serialization. So instead, NumPy arrays internally store numbers not as Python lists of Python integers, but as C arrays of C integers. Operations on NumPy arrays therefore don’t require deserializing every entry in the array.

For example, we can create a NumPy array that is a range of numbers. That will involve creating a C array with C integers, much less work (and much less memory!) than a Python list of Python integers. We can then sum it, and that logic will all be in C, with no need for deserialization except for the final sum:

In [1]: import numpy as np

In [2]: l = list(range(1_000_000))

In [3]: arr = np.arange(1_000_000)

In [4]: type(arr)
Out[4]: numpy.ndarray

In [5]: sum(l), arr.sum()
Out[5]: (499999500000, 499999500000)

In [6]: %timeit sum(l)
7.68 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit arr.sum()
620 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The NumPy code takes less than a millisecond: it is 12× faster than Python. Success!

Solution #2: Move more work into the extension

Another approach we can take is moving more of the calculation into the extension. In all of our examples we’ve been summing a range of numbers, typically 0 to 999,999. If that’s all we need to do, we don’t have to create a whole list of numbers in Python, we can just write a Cython function that sums a range of integers:

def sum_range(long start, long end):
    cdef long i, result
    result = 0
    for i in range(start, end):
        result += i
    return result

We can measure the performance:

In [1]: from example_cython import sum_range

In [2]: sum(list(range(1_000_000))), sum_range(0, 1_000_000)
Out[2]: (499999500000, 499999500000)

In [3]: %timeit sum_range(0, 1_000_000)
306 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That’s pretty good, even faster than NumPy’s implementation, plus we don’t have the overhead of creating a whole array.

But we can do better!

Let’s switch languages, and try out Rust. Unfortunately, the Rust Python bindings via the PyO3 library have a higher overhead for calls and (de)serialization than Cython, because PyO3 is a newer project. This is improving over time: PyO3’s v0.14 release has a noticeable reduction in function call overhead, for example. On the plus side, Rust has memory safety, thread safety, and a much higher-level language that still has the same performance as C/C++.

In Rust, we can use a range object that is not that different from Python’s slices:

#[pyfunction]
fn sum_range(start: u64, end: u64) -> u64 {
    assert!(start <= end);
    (start..end).sum()
}

We can then measure the performance:

In [1]: from example_rust import sum_range

In [2]: sum(list(range(1_000_000))), sum_range(0, 1_000_000)
Out[2]: (499999500000, 499999500000)

In [3]: %timeit sum_range(0, 1_000_000)
165 ns ± 0.0381 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Notice the result is in nanoseconds. Rust is 1,800× faster than Cython–how is this possible?

The best solution: do less work

It turns out that when we moved more of the work into a high-level construct like Rust’s Ranges, we ended up with a much more clever solution. Summing an arbitrary array of integers requires looping over all the values, and so it’s O(N): the more values in the array, the longer it will take. And summing a consecutive range of integers can be done the same way.

Or we can take advantage of the fact it’s consecutive, in which case it can be done in a fixed amount of time, using for example (start + end)(N / 2). The LLVM compiler toolchain that Rust uses is smart enough to use a (slightly different) fixed-time calculation. You can see the resulting assembly here, if you’re interested. Keep in mind that this isn’t really Rust specific; on macOS Python C extensions are compiled with the LLVM-based clang by default, so it’s possible the Cython version would also get this optimization.

Given this optimization, unlike the summing-in-a-loop implementation the size of the range doesn’t matter much:

In [4]: %timeit sum_range(0, 100)
188 ns ± 0.616 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit sum_range(0, 100_000_000)
189 ns ± 0.132 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

We can implement the faster algorithm in Python too:

In [15]: def py_sum_range(start, end):
    ...:     return (start + end - 1) * (end - start) // 2
    ...: 

In [16]: py_sum_range(0, 1_000_000)
Out[16]: 499999500000

In [17]: %timeit py_sum_range(0, 1_000_000)
252 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Our Rust code is faster than this Python implementation, but only by a little. And our Python implementation is still vastly faster than the Cython or NumPy solutions we used above. By constraining the problem, we’ve come up with a fundamentally superior solution, and quite possibly it’s fast enough not to need a compiled extension at all.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Reducing extension overhead

What have we learned about optimizing Python code?

Start by trying to redefine the problem in a way that requires less work. You might be able to get a massive speedup using plain old Python.

If you do need to write an extension:

  1. Try to move as much of the computation as possible into the extension, to reduce Python prep overhead, serialization costs, and function call overhead.
  2. If you’re dealing with large numbers of objects, reduce serialization overhead by having a custom extension type that can store the data with minimal conversions to/from Python, the way NumPy does with its array objects.