Faster CPython at PyCon, part one

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jake Edge
May 9, 2023

PyCon

Two members of the Faster CPython team, which was put together at Microsoft at the behest of Guido van Rossum to work on major performance improvements for CPython, came to PyCon 2023 to report on what the team has been working on—and its plans for the future. PEP 659 ("Specializing Adaptive Interpreter") describes the foundation of the current work, some of which has already been released as part of Python 3.11. Brandt Bucher, who gave a popular talk on structural pattern matching at last year's PyCon, was up first, with a talk on what "adaptive" and "specializing" mean in the context of Python, which we cover here in part one. Mark Shannon, whose proposed plan for performance improvements in 2020 was a major impetus for this work, presented on the past, present, and future of the Python performance enhancements, which will be covered in part two.

Bucher started out by talking a bit about Python bytecode, but said that developers do not need to know anything about it to get faster code: "just upgrade to 3.11 and you'll probably see a performance improvement". In order to show how the team sped up the interpreter, though, it would help to look inside the Python code. He put up an example Point class that has two floating point attributes for its position (x, y) and a single shifted() method that takes two offsets to apply to the position of an instance, and returns a new instance of the class at the shifted position. He focused on two lines from the method:

    y = self.y + dy
    cls = type(self)

The first line applies the offset for the y axis, while the second gets a reference to the Point class (in preparation for returning a new instance by calling cls(x, y)). He used the dis module to disassemble the method; the relevant piece of bytecode is as follows:

    # part of the output of dis.dis(Point.shifted)
    LOAD_FAST      (dy)
    LOAD_FAST      (self)
    LOAD_ATTR      (y)
    BINARY_OP      (+)
    STORE_FAST     (y)

    LOAD_GLOBAL    (type)
    LOAD_FAST      (self)
    PRECALL        (1)
    CALL           (1)
    STORE_FAST     (cls)

Some of the binary opcodes (and some operations, such as method calls) have changed from those in Python 3.10 and earlier, so your output may be different than what he showed. The bytecode is a set of instructions for the Python stack-based virtual machine. In the first part above, the dy and self values are pushed onto the stack, then the LOAD_ATTR instruction retrieves the y attribute from the value popped from the stack (self), then pushes it. Next, the two values (self.y and dy) on top of the stack are added (and the result is pushed) and that result is popped to be stored in the variable y. In the second part, type and self are pushed, then PRECALL and CALL are two separate steps to perform the type(self) call; its result is popped to store into cls.

Adaptive instructions

If that code is run more than a handful of times, Bucher said, it becomes a candidate for optimization in 3.11. The first step is something called "quickening", which the PEP calls "the process of replacing slow instructions with faster variants". "Superinstructions" that combine two related instructions into a single instruction are substituted into the code. For example, the first two LOAD_FAST instructions can be replaced with a single one:

    LOAD_FAST__LOAD_FAST   (dy, self)

It is a simple change that results in a real performance boost, he said. The more interesting change is to replace some instructions with their adaptive counterparts. During the quickening process, five bytecodes are replaced with adaptive versions:

    # part of the output of dis.dis(Point.shifted, adaptive=True)
    LOAD_FAST__LOAD_FAST   (dy, self)
    LOAD_ATTR_ADAPTIVE     (y)
    BINARY_OP_ADAPTIVE     (+)
    STORE_FAST             (y)

    LOAD_GLOBAL_ADAPTIVE   (type)
    LOAD_FAST              (self)
    PRECALL_ADAPTIVE       (1)
    CALL_ADAPTIVE          (1)
    STORE_FAST             (cls)

The adaptive versions perform the same operation as their counterpart except that they can specialize themselves depending on how they are being used. For example, loading an attribute is "actually surprisingly complex", because the attribute could come from a number of places. It could be a name from a module, a class variable from a class, a method of a class, and so on, so the standard load-attribute code needs to be prepared for those possibilities. The simplest case is getting an attribute from an instance dictionary, which is exactly what is being done here.

The LOAD_ATTR_ADAPTIVE operation can recognize that it is in the simple case and can ignore all of the rest of possibilities, so the adaptive instruction changes to LOAD_ATTR_INSTANCE_VALUE, which is a specialized instruction that only contains the code fast path for this common case.

The specialized instruction will then check to see if the class is unchanged from the last time this attribute was accessed and if the keys in the object's __dict__ are the same. Those much faster checks can be done in lieu of two dictionary lookups (on the class dictionary for a descriptor and on the object's __dict__ for the name); "dictionary lookups are fast, but they still cost something", while the other two checks are trivial. If the conditions hold true, which is the normal situation, the code can simply return the entry from the __dict__ at the same offset that was used previously; it does not need to do any hashing or collision resolution that comes when doing a dictionary lookup.

If either of the checks fails, the code falls back to the regular load attribute operation. Python is a dynamic language and the new interpreter needs to respect that, but the dynamic features are not being used all of the time. The idea is to not pay the cost of dynamism when it is not being used, which is rather frequent in many Python programs.

Similarly, the BINARY_OP_ADAPTIVE instruction specializes to BINARY_OP_ADD_FLOAT because floating-point values are being used. That operation checks that both operands are of type float and falls back to the (also surprisingly complex) machinery for an add operation if they are not. Otherwise, it can just add the raw floating point values together in C.

Normally, when a global name is being loaded, it requires two dictionary lookups; for example, when type() is being loaded, the global dictionary must be checked in case the function has been shadowed, if not (which is likely), then it must be looked up in the builtins dictionary. So LOAD_GLOBAL_ADAPTIVE takes a page from the attribute-loading specialization to check if the global dictionary or builtins dictionary have changed; if not, it simply grabs the value at the same index it used before.

It turns out that type() is called often enough that it gets its own specialized bytecode. It checks that the code for type() has not changed (by way of a monkey patch or similar) and, if not, simply returns the argument's class. There is a call in the C API to do so and "it's much much cheaper than making the call".

If the specialized instructions fail their tests enough times, they will revert back to the adaptive versions in order to change course. For example, if the Point class starts being used with integer values, BINARY_OP_ADD_FLOAT will revert to BINARY_OP_ADAPTIVE, which would replace itself with BINARY_OP_ADD_INT after a few uses. "This is what we mean by specializing adaptive interpreter".

It may seem like a "bunch of small localized changes"—and it is—but they add up to something substantial. For example, the shifted() method is nearly twice as fast in 3.11 versus 3.10. He obviously chose the example because it specializes well; a 2x performance increase is probably at the upper end of what can be expected. But it does show how replacing the generalized versions of these Python bytecode operations with their more specialized counterparts can lead to large improvements.

Bucher said that the various pieces of information that are being stored (e.g. the index of a dictionary entry) are actually being placed into the bytecode itself. He called those "inline caches" and they can be seen using the show_caches=True parameter to dis.dis(). But Python programmers should not really need to look at the caches, or even at the adaptive instructions, because all of that should not matter. The idea is that the interpreter will still do what it always did, but it can now do it faster in many cases.

For those who do want to dig under the covers more, though, Bucher recommended his specialist tool. Running some code with specialist will generate a web page with the source code of the program color-coded to show where the interpreter is optimizing well—and where it is not.

Coming soon

He then shifted gears to looking at things that will be coming in CPython 3.12. To start with, the dedicated quickening step has been eliminated and the superinstructions are simply emitted at compile time. In addition, there will be only a single call instruction rather than the two-step dance in 3.11.

Instead of having separate adaptive versions of the operations, the standard bytecodes implement the adaptivity. A bytecode only needs to be executed twice in order to become fully specialized. That is measured on individual bytecodes, rather than a function as a whole, so a loop will specialize immediately even if the surrounding code only gets executed once.

The team has also added more specialized instructions. "We have gotten better at specializing dynamic attribute accesses", so there are specializations for calling an object's __getattribute__() and for loading properties specified using the property() builtin. There are specializations for iterating over the four most-common objects in a for loop: list, tuple, range, and generator. There is also a single specialization for yield from in generators and await in coroutines. There are two others in the works that may make it into CPython 3.12.

Meanwhile, the inline caches are being reduced. Having more cached information makes the bytecode longer, which means longer jumps and more memory use, so shrinking the amount of cached data is welcome. The team has been able to eliminate some cache entries or reorganize the caches to make some significant reductions. All of that is coming in Python 3.12—and beyond.

Stay tuned for our coverage of Shannon's talk, which came on the second day of the conference. It will be the subject of part two, which is coming soon to an LWN near you.

[I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]

Index entries for this article
Conference	PyCon/2023
Python	CPython
Python	Performance

(Log in to post comments)

Faster CPython at PyCon, part one

Posted May 10, 2023 6:37 UTC (Wed) by tchernobog (subscriber, #73595) [Link]

Does anybody know if specifying type annotations such as floats are able to skip the adaptive instructions and go to their specialized counterparts directly?

Faster CPython at PyCon, part one

Posted May 10, 2023 6:43 UTC (Wed) by dureuill (subscriber, #164965) [Link]

To my knowledge it doesn't: type annotations are not used at runtime. Plus, nothing at runtime prevents a user from calling such an annotated function with the wrong types anyway.

Faster CPython at PyCon, part one

Posted May 10, 2023 8:50 UTC (Wed) by jpfrancois (subscriber, #65948) [Link]

Are there any benchmark number available that would illustrate these performance improvements for 3.11 and 3.12 ?

Faster CPython at PyCon, part one

Posted May 10, 2023 13:04 UTC (Wed) by Twirrim (subscriber, #122505) [Link]

Not precisely yet, but https://speed.python.org/comparison/ can help you compare against latest branch of master

https://speed.python.org/comparison/?exe=12%2BL%2B3.11%2C... is my best attempt to do comparisons. That gives you current "master" compared to 3.11, using normalised against the latter.

Faster CPython at PyCon, part one

Posted May 10, 2023 13:22 UTC (Wed) by Wol (subscriber, #4433) [Link]

> That gives you current "master" compared to 3.11

Is that "Python for Workgroups"?

(G, d & r)

Cheers,
Wol

Faster CPython at PyCon, part one

Posted May 10, 2023 20:26 UTC (Wed) by pebolle (subscriber, #35204) [Link]

> [I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]

You're welcome. That's why we support LWN, to cover its expenses and provide its staff with an income. That is all pretty obvious. So why did you feel the need to add this to this article?

Faster CPython at PyCon, part one

Posted May 10, 2023 21:12 UTC (Wed) by amacater (subscriber, #790) [Link]

Possibly this is written in this way so that it can be shown that this wasn't a cost-free attendance funded by the conference giver in return for a favourable review.

Faster CPython at PyCon, part one

Posted May 10, 2023 21:43 UTC (Wed) by pebolle (subscriber, #35204) [Link]

> Possibly this is written in this way so that it can be shown that this wasn't a cost-free attendance funded by the conference giver in return for a favourable review.

Yes, that's a murky area. I'm not sure what to think here. Is it OK for the press to accept free entrance to expensive conferences?

But the disclaimer only mentioned travel, which I think is a pretty obvious expense for LWN. Actually, I'm totally OK with whatever LWN decides to spend our contributions on. Free champagne and caviar for its staff? Fine with me.

Faster CPython at PyCon, part one

Posted May 12, 2023 5:26 UTC (Fri) by ebwb (guest, #144520) [Link]

I always feel delighted when I see it. Sometimes it's nice to have a bit of feedback from a different, perhaps atypical channel as to where my funding helps support LWN's work.

Maybe they literally mean to thank you, me, and all of us! Surely there are some conferences the individual LWN staff members wouldn't be able to attend were it not for their work, so they are expressing gratitude for the fact that we're all here making it happen for those individuals.

Faster CPython at PyCon, part one

Posted May 10, 2023 21:15 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link]

> You're welcome. That's why we support LWN, to cover its expenses and provide its staff with an income. That is all pretty obvious. So why did you feel the need to add this to this article?

It looks like it is done to distinguish it from articles written where some other entity is sponsoring the travel. As an example, https://lwn.net/Articles/637211/ has this note instead:

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

IIRC this was prompted by questions from readers before on whether sponsorship is involved and in some cases, the answer is yes.

Faster CPython at PyCon, part one

Posted May 10, 2023 21:31 UTC (Wed) by pebolle (subscriber, #35204) [Link]

> IRC this was prompted by questions from readers before on whether sponsorship is involved and in some cases, the answer is yes.

But apparently no sponsorship was involved. So why bother with this message here?
Otherwise every article might end with:

[I would like to thank LWN subscribers for funding the device I've written this article on and the purchase of literature I consulted while researching this article.]

Yes, we know. That's why we support you.

Faster CPython at PyCon, part one

Posted May 11, 2023 0:15 UTC (Thu) by jake (editor, #205) [Link]

> You're welcome. That's why we support LWN, to cover its expenses and provide its staff with an
> income. That is all pretty obvious. So why did you feel the need to add this to this article?

We put those on articles in part to remind others, who may not be subscribers, that it does cost money to do all of this stuff. And of course to thank folks like you.

FWIW, we do generally get free tickets to conferences, either because we are speaking or because we ask for a press pass (as with PyCon). Press passes are pretty normal for tech (and other) journalism. We presumably could not go to as many conferences without them.

jake

Faster CPython at PyCon, part one

Posted May 10, 2023 21:48 UTC (Wed) by kleptog (subscriber, #1183) [Link]

I first heard of this trick of specialising opcodes with respect to the Erlang VM. The interesting thing is that this is a step in the direction of a JIT. Namely, once you can recognise that all the instructions in a loop are specialised versions, it's but a small step to emitting asm code directly.

Faster CPython at PyCon, part one

Posted May 11, 2023 11:56 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

Except in Python the 'emitted asm' still has to check that the necessary invariants have not changed; dynamic languages make this much more difficult than otherwise. At some point if the 'invariant checking' code is large enough, inlining it in every location where the JIT has turned the code into 'emitted asm' will have its own costs (memory pressure, caching, etc.)

Faster CPython at PyCon, part one

Posted May 11, 2023 17:24 UTC (Thu) by anton (subscriber, #25547) [Link]

One step between interpreted virtual-machine (VM) code and native code ("emmitting asm") is dynamic superinstructions (aka selective inlining). Combining this with quickening is non-trivial, but possible. Having multiple alternative quickened superinstructions, and switching back to the slow variant, and then to a different quickened superinstruction is not covered in that paper, but I think it's possible.

Faster CPython at PyCon, part one

Posted May 12, 2023 7:23 UTC (Fri) by Wol (subscriber, #4433) [Link]

I doubt they'll go down this road, but ...

Interpreted VM code is also known as p-code, no? And I know of at least two CPUs that were designed (Pascal) or modified (50-series) to execute p-code directly :-)

Isn't that also pretty much a description of CRISC systems?

Cheers,
Wol

Faster CPython at PyCon, part one

Posted May 15, 2023 10:07 UTC (Mon) by massimiliano (subscriber, #3048) [Link]

If you want to see state-of-the-art dynamic opcode specialization look at what V8 (the Chrome Javascript VM) has been doing for more than one decade :-)

Faster CPython at PyCon, part one

Posted May 12, 2023 19:03 UTC (Fri) by gps (subscriber, #45638) [Link]

When continual invariant checking would be overwhelming, another approach is commonly used: assume invariants are true but record all state changes in a way that can be undone. So you proceed with some execution and do the invariant check before any point that would force you to commit and expose your state change to the broader system in a way that cannot be rolled back. In the rare case those invariant checks fail, you back out all uncommitted state changes from that piece of execution and take the slow invariant handling path. If it fails often, you flag that bit of code and don't run such a state assuming aggressive JIT translation on that part in the future.

I don't know how feasible this approach is for a high level language, but it has been implemented time and time again in JIT/translation-like systems over the past several decades.

We did this in Transmeta's CMS (code morphing software) that implemented x86, but that was obviously much lower level and on hardware designed for the purpose.