The Parallelism Blues: when faster code is slower

When you’re doing computationally intensive calculations with NumPy, you’ll want to use all your computer’s CPUs. Your computer has 2 or 4 or even more CPU cores, and if you can use them all then your code will run faster.

Except, of course, when parallelism makes your code run slower.

As it turns out, for certain operations NumPy will parallelize operations transparently. And if you’re not careful, this can actually slow down your code.

Parallelism makes it faster

Consider the following program:

import numpy as np
a = np.ones((4096, 4096))
a.dot(a)

If we run this under the time utility, we see:

$ time python dot.py 
real    0m1.546s
user    0m4.171s
sys     0m0.537s

As I explain elsewhere in more detail, the real time is the wall-clock time, the user time is CPU time. Since in this case user time is higher than wall-clock time, that means the operation used multiple CPUs, for a total of 4.17 CPU seconds.

So that’s great! If we’d only used one CPU, this operation would have taken ~4.2 seconds, but thanks to multiple CPUs it took only ~1.5 seconds.

Parallelism makes it slower

Let’s verify that assumption.

We’ll tell NumPy to use only one CPU, by using the threadpoolctl library:

import numpy as np
from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1, user_api='blas'):
    a = np.ones((4096, 4096))
    a.dot(a)

And now when we run it:

$ time python dot_onecpu.py 
real    0m3.654s
user    0m3.652s
sys     0m0.403s

When we used multiple CPUs it took ~4.2 CPU seconds, but with a single CPU it took ~3.7 CPU seconds. Measured by CPU time, the code is now faster!

Does this matter? Isn’t a faster wall-clock time what matters?

If you’re only running this one program on your computer, and you don’t have any other parallelism implemented for your program, then yes, this is fine. But if you are implementing some form of parallelism yourself, for example by using multiprocessing, joblib, or my personal favorite Dask, the default parallelism will make your program slower overall.

In this example, every dot() call will take 13% more of your overall CPU capacity.

BLAS!

You’ll notice in the code above that the thread pool limits referred to BLAS. BLAS is an API for linear algebra that NumPy uses for some of its operations, in this case the dot().

The are different BLAS implementations available, and in the example above I was using OpenBLAS. Another alternative is mkl, which is provided by Intel and therefore is optimized for Intel processors; you don’t want to use it on AMD.

For this operation, at least, mkl seems to have the same issue: it runs faster on a single CPU than it does when parallelized to multiple CPUs. In general it’s worth seeing if switching will give you some performance improvement.

If you’re using Conda Forge, you can do that by having either the package blas=*=openblas or the package blas=*=mkl in your environment.yml.

What if it’s just this benchmark?

One could argue that this is just one benchmark, and there a variety of ways I could have screwed it up. And while that’s true, this is in the end just an example:

It would be extremely surprising, then, if running with N threads actually gave ×N performance.

So, yes, you often do want parallelism, but you need to think about where and how and when you use it.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Reduce parallelism to get more parallelism

If you’re implementing a high-level form of parallelism with Dask or the like, you might want to disable the multi-threading in NumPy. Individual operations will use less CPU time overall, and your own parallelism will ensure utilization of multiple CPUs.

In addition, be careful when profiling: if you’re using automatic parallelization, what you’re profiling might not match the behavior on a different computer with a different number of CPUs.