Where exactly does Python 3.11 get its ~25% Speedup? Part 1: General Optimizations

28 October, 2022

Python 3.11 was released a few days ago and as usual, it comes with more than a few new features that make it ever more interesting, from exception groups and fine-grained error locations and tracebacks to TOML parsing support in the standard library and of course the much awaited speedup as part of the faster CPython project.

CPython 3.11 is 25% faster than CPython 3.10 on average according to benchmarks with pyperformance. The changes that are being done are part of something that Guido called: The "Shannon Plan". For the 3.11 release, the plan was about making a lot of optimizations in two main areas: startup and runtime. The release also includes additional optimizations that are not part of the faster CPython project.

In this first part I will go through the details of the general optimizations (ie. non faster CPython project improvements) that made it to the stable 3.11.0 release.

Optimized printf-style % formatting on some format codes

Using formatted string literals is the fastest way to format strings.

A simple benchmark in Python 3.10:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 187 ns +- 8 ns

But using f-strings seems to be ~ 42% faster:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- 'f"{k!s} = {v!r}"'
.....................
Mean +- std dev: 131 ns +- 9 ns

The optimization works by converting simple C-style formatting into f-string. In 3.11.0, only %s, %r and %a format codes are converted, but there's an ongoing pull request to support: %d, %i, %u, %o, %x, %X, %f, %e, %g, %F, %E, %G.

For example, here's the same benchmark in Python 3.11:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 100 ns +- 5 ns

That's about 87% faster! Of course additional optimizations in 3.11 are at play here, like the faster interpreter startup time.

ToC

Faster int division for Python's bignums

In Python 3.10:

python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 1.18 us +- 0.02 us

In Python 3.11:

python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 995 ns +- 15 ns

That's about 18% faster.

The optimization evolved from an observation from Mark Dickinson that a 128:64 divide instruction was always being generated by the compiler despite working with 30-bit digit values.

Python divisions are somewhat crippled even on x64. Assuming 30-bit digits, the basic building block that's needed for multi-precision division is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient (and ideally also a 32-bit remainder). And there's an x86/x64 instruction that does exactly that, namely DIVL. But without using inline assembly, current versions of GCC and Clang apparently can't be persuaded to emit that instruction from the longobject.c source - they'll use DIVQ (a 128-bit-by-64-bit division, albeit with the top 64 bits of the dividend set to zero) on x64, and the __udivti3 or __udivti4 intrinsic on x86.
- Mark Dickinson (full context)

ToC

Faster `sum` of single digit PyLongs

This issue started by the realization that sum is much faster in Python 2.7 than in Python 3. Unfortunately, that still seems to be the case in 3.11.0 under certain conditions.

Python 2.7:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 37.4 us +- 1.1 us

Python 3.10:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 52.7 us +- 1.3 us

Python 3.11:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 39.0 us +- 1.0 us

The difference between Python3.10 and 3.11 is that the performance of calling sum on single digit PyLongs has improved by inlining the unpacking of single digit PyLongs in the fast-addition branch of the sum function. Doing it this way avoids the call to PyLong_AsLongAndOverflow when unpacking.

What's interesting to note is that Python 3.11 is still significantly slower than Python 2.7 summing integers in some cases. We'll hopefully see additional improvements with a more efficient implementation of integers in Python.

ToC

Faster `list.append` by streamlining list resize

There's noticeable speedup with list.append (about 54% faster) with Python 3.11.

List append - Python 3.10:

$ python -m pyperf timeit -s \
  'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 605 us +- 20 us

List append - Python 3.11:

$ python -m pyperf timeit -s \
  'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 392 us +- 14 us

There's also some minor improvement with simple list comprehensions:

Python 3.10:

$ python -m pyperf timeit -s \
  '' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 553 us +- 19 us

Python 3.11:

$ python -m pyperf timeit -s \
  '' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 516 us +- 16 us

ToC

Smaller dictionary size when keys are all Unicode

This optimization makes Python more cache efficient when using dictionaries with all-Unicode keys. This is because the amount of ram used is reduced because the hash for those Unicode keys is dropped as Unicode objects have the hash already.

For example on a 64bit platform, in Python 3.10:

>>> sys.getsizeof(dict(foo="bar", bar="foo"))
232

While in Python 3.11:

>>> sys.getsizeof(dict(foo="bar", bar="foo"))
184

ToC

Faster file transfer for large files when using `asyncio.DatagramProtocol`

asyncio.DatagramProtocol provides a base class for implementing datagram (UDP) protocols. With this optimization, transferring large files (e.g. ~60 MiB) using asyncio UDP will be over 100 times faster than in Python 3.10. This is done by calculating the size of the buffer once and storing it in a property. This makes using asyncio.DatagramProtocol orders of magnitude faster when transferring large files over UDP.

The author of the PR msoxzw provided the following test script.

ToC

In `math` lib: Faster `comb(n, k)` and `perm(n, k=None)`

Python 3.8 added the comb(n, k) and perm(n, k=None) function in the math library of Python. Both are used to get the number of ways to choose k items from n items without repetition, both comb returns it without order while perm returns it with order. The optimization is achieved with multiple smaller improvements like using a divide-and-conquer algorithm for getting benefit of Karatsuba multiplication of large numbers and doing calculations for comb in C unsigned long long type instead of Python integers when possible (*). Another improvement for smaller k values (For 0 <= k <= n <= 67):

For 0 <= k <= n <= 67, comb(n, k) always fits into a uint64_t. We compute it as comb_odd_part << shift where 2 ** shift is the largest power of two dividing comb(n, k) and comb_odd_part is comb(n, k) >> shift. comb_odd_part can be calculated efficiently via arithmetic modulo 2 ** 64, using three lookups and two uint64_t multiplications, while the necessary shift can be computed via Kummer's theorem: it's the number of carries when adding k to n - k in binary, which in turn is the number of set bits of n ^ k ^ (n - k). *

One more improvement is that the previous popcount-based code for computing the largest power of two dividing math.comb(n, k) (for small n) got replaced with a more direct method based on counting trailing zeros of the factorials involved. (*).

Quick pyperf example, in Python 3.10:

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 3.72 us +- 0.07 us

# ---

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 11.9 ms +- 0.1 ms

Python 3.11:

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 476 ns +- 20 ns

# ---

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 2.28 ms +- 0.10 ms

ToC

In `statistics` lib: Faster `mean(data)`, `variance(data, xbar=None)` and `stdev(data, xbar=None)`

The optimization improves the mean, variance, and stdev functions in the statistics module. If the input is an iterator, it is consumed in a single pass rather than converting it to a list. The single pass algorithm is twice as fast as the previous two pass code. *

Python 3.10:

# Mean
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 255 us +- 11 us

# Variance
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 77.0 us +- 2.9 us

# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 78.0 us +- 2.2 us

Python 3.11:

# Mean
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 193 us +- 7 us

# Variance
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 56.1 us +- 2.3 us

# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 59.4 us +- 2.6 us

ToC

Constant time normalization of pure-ASCII strings in `unicodedata.normalize()`

This optimization improves how unicodedata.normalize() by returning quickly from the unicode quickcheck algorithm if the provided input is a pure ASCII string. The check is done using PyUnicode_IS_ASCII.

Python 3.10:

$ python -m pyperf timeit -s \
  'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 83.3 ns +- 4.3 ns

Python 3.11:

$ python -m pyperf timeit -s \
  'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 34.2 ns +- 1.2 ns

ToC

Final notes:

I wrote this post for my own edification about the recent improvements in Python 3.11. If I got something wrong, please let me know by email or on Twitter if you prefer.
Comments on HackerNews
In Part 2, I'll go over the optimizations provided by the faster CPython project.

Where exactly does Python 3.11 get its ~25% Speedup? Part 1: General Optimizations

Table of Contents

Optimized printf-style % formatting on some format codes

Faster int division for Python's bignums

Faster sum of single digit PyLongs

Faster list.append by streamlining list resize

Smaller dictionary size when keys are all Unicode

Faster file transfer for large files when using asyncio.DatagramProtocol

In math lib: Faster comb(n, k) and perm(n, k=None)

In statistics lib: Faster mean(data), variance(data, xbar=None) and stdev(data, xbar=None)

Constant time normalization of pure-ASCII strings in unicodedata.normalize()

Faster `sum` of single digit PyLongs

Faster `list.append` by streamlining list resize

Faster file transfer for large files when using `asyncio.DatagramProtocol`

In `math` lib: Faster `comb(n, k)` and `perm(n, k=None)`

In `statistics` lib: Faster `mean(data)`, `variance(data, xbar=None)` and `stdev(data, xbar=None)`

Constant time normalization of pure-ASCII strings in `unicodedata.normalize()`