Faster CPython at PyCon, part two

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge
May 12, 2023

PyCon

In part one of the tale, Brandt Bucher looked specifically at the CPython optimizations that went into Python 3.11 as part of the Faster CPython project. More of that work will be appearing in future Python versions, but on day two of PyCon 2023 in Salt Lake City, Utah, Mark Shannon provided an overall picture of CPython optimizations, including efforts made over the last decade or more, with an eye toward the other areas that have been optimized, such as the memory layout for the internal C data structures of the interpreter. He also described some additional optimization techniques that will be used in Python 3.12 and beyond.

Background

Shannon said that he had been thinking about the ideas for speeding up CPython for quite some time; he showed a picture of him giving a presentation at EuroPython 2011 on the subject. He has been researching virtual machines and performance improvements for them since 2006 or so. He wanted to think more in terms of time spent, rather than speed, however. If you wanted to achieve a 5x speedup, that is an 80% reduction in the execution time.

In order to achieve these performance increases, it is important to consider the performance of the entire runtime; if you are able to speed up 90% of a program by nine times, but at the cost of slowing down the remaining 10% nine times as well, there is no difference in the execution speed. Even making 80% of the program 10x faster at the cost of a 3x reduction for the remainder, you only reduce the execution time to 68% of what it originally was.

People focus on the just-in-time (JIT) compiler for the V8 JavaScript engine, but that is only part of what allows that engine to be so fast. It has "incredibly sophisticated garbage collection", optimized object layouts, and "all sorts of clever things to speed up many aspects of what it has to do". For example, it does its garbage collection incrementally every 1/60th of a second, so as not to disturb animations. "Yes it has a just-in-time compiler, but there's many other parts to it".

There are some guiding principles that the Faster CPython project is following in order to improve the performance of the language. The first is that "nothing is faster than nothing"; the "best way to make something faster is to not do it at all". The project bends the rule a bit by, say, only doing something once ahead of time, rather than doing it over and over as the program executes.

Another principle involves speculation, which is making guesses about future behavior based on past behavior. A CPU's hardware branch predictor does the same kind of thing; it speculates on which branch will be taken based on what has come before (though, of course, we know now that hardware speculation comes with some dangers). The interpreter can take advantage of speculation; if the previous 99 times it added two things together they were always integers, it is pretty likely they will be integers the 100th time as well.

Memory layout

Efficient data structures are another important part of what the project is working on; by that he means "how we lay stuff out in memory". The goal is to have "more compact and more efficient data structures", which will require fewer memory reads, so that more of the data the program needs lives in the caches. To start with, he wanted to talk about reductions in the size of the Python object, which has mostly already been done at this point. He gave an example do-nothing class:

    class C:
        def __init__(self, a, b, c, d):
            self.a = a
            ...

If you look at an instance, it has a simple, obvious instance dictionary:

    >>> C(1, 2, 3, 4).__dict__
    {'a': 1, 'b': 2, 'c': 3, 'd': 4}

Back in the "olden days" of Python 2.7—"maybe some of you are lucky enough and young enough to not remember 2.7, but most of us do"—even up through Python 3.2, the Python objects used for representing an instance were complicated, weighing in at 352 bytes (on a 64-bit machine). The object itself is relatively small, but it points to two other objects: a reference to the class object (i.e. C) and another for the instance __dict__. The class reference is shared by all of the instances; for 1000 instances, the price of that object is amortized, so he was ignoring that. Data that can be shared between instances can be similarly ignored, thus this sharing is desirable.

But the __dict__ is specific to each instance and contains a hash table with keys and their hashes that are identical for each instance, which is redundant. So in Python 3.3, the keys and hashes were moved into a shared structure, which reduced the size to 208 bytes per instance. The values were still stored in a table with space for additional keys, but that went away in Python 3.6 with the addition of compact dictionaries, which had the side effect of causing dictionaries to maintain their insertion order. The compact dictionaries dropped the size to 192 bytes.

There were still some small inefficiencies in the object header because there were three word-sized garbage-collection header fields, which meant another word was added for alignment purposes. In Python 3.8 one of those garbage-collection fields was removed, so the alignment padding could be as well. That reduced the cost of each instance to 160 bytes, "which is already less than half of where we started".

But, in truth, the dictionary object itself is actually redundant. Nearly all of the information that the object has can be gotten from elsewhere or is not needed. It has a class reference, but that is already known: it is a dict. The keys can be accessed from the shared C class object and the table of values can be moved into the instance object itself. So that stuff was eliminated in Python 3.11, reducing the size per instance to 112 bytes.

Python 3.12 will rearrange things a bit to get rid of another padding word; it also shares the reference to the values or __dict__ by using a tag in the low-order bit. The __dict__ is only used if more attributes are added to the instance than the four initial ones. That results in 96 bytes per instance. There are some more things that could be done to perhaps get the size down to 80 bytes in the future, but he is not sure when that will happen (maybe 3.14 or 3.15).

So, from Python 2.7/3.2 to the hypothetical future in a few years, the size of an instance of this object has dropped from 352 to 80 bytes, while the number of memory accesses needed to access a value dropped from five to two. That is still roughly twice as much work (and memory) as Java or C++ need, but it was five times as much work and memory at the start. There is still a price for the dynamism that Python provides, but to him (and he hopes the audience agrees) it has been reduced to a "reasonable price to pay".

Interpreter speedups

He switched over to looking at what has been done on speeding up the interpreter over the years as an introduction to what is coming on that front in the future. Unlike reducing object sizes, not much work has gone into interpreter speedups until quite recently. In 3.7, method calls were optimized so that the common obj.method() pattern did not require creating a temporary callable object for the method (after the attribute lookup) before calling it. In addition, the values for global names started to be cached in Python 3.8, so instead of looking up, say, int() in the Python builtins every time it was needed, the cache could be consulted; global variables were treated similarly. Looking up builtins was somewhat costly since it required checking the module dictionary first to see if the name had been shadowed; now the code checks to see if the module dictionary has changed and short-circuits both lookups if it has not.

The PEP 659 ("Specializing Adaptive Interpreter") work went into Python 3.11; it is focused on optimizing single bytecode operations. But he would not be covering that since Bucher had given his talk the previous day. In fact, Shannon suggested that people watching his talk online pause it to go watch Bucher's talk; since Bucher had done much the same thing in his talk, it made for a bit of mutually recursive fun that the two had obviously worked out in advance.

The future work will be focused on optimizing larger regions of code; "obviously one bytecode is as small a region as we can possibly optimize", Shannon said, but optimizers "like bigger chunks of code" because it gives them more flexibility and opportunities for improving things. Some of this work would likely appear in 3.13, but he was not sure how much of it would.

He used a simple function add() that just adds its two arguments; it is a somewhat silly example, but larger examples do not fit on slides, he said. If a particular use of the function needs optimization, because it is done frequently, the bytecode for add() can effectively be inlined into a use of it. But, because of Python's dynamic nature, there must be a check to determine if the function has changed since the inlining was done; if so, the original path needs to be taken. Then, the specialization mechanism (which Bucher covered) can be used to check that both operands are integers (assuming the profiling has observed that is what is normally seen here) and perform the operation as a "considerably faster" integer-addition bytecode.

That specialization enables a more powerful optimization, partial evaluation, which is a huge area of research that he said he could only scratch the surface of in the talk. The idea is to evaluate things ahead of time so that they do not have to be recomputed each time. His add() example had optimized the following use of the function:

    a = add(b, 1)

But there are parts of even the optimized version that can be removed based on some analysis of what is actually required to produce the correct result. The first inlined and specialized version of that statement required 13 bytecode operations, some of which are rather expensive.

Doing a kind of virtual execution of that code, and tracking what is needed in order to produce the correct result, reduced that to five bytecode instructions. It effectively only needs to check that b is an integer, then does the integer addition of the two values and stores the result. "What I've clearly shown here is that with a suitably contrived example you can prove anything," he said with a grin, but "this is a real thing" that can be used for Python. When the video becomes available, which should hopefully be soon, it will be worth watching that part for those interested in understanding how that analysis works. [Update: Video link]

Combining

The optimization techniques that he has been talking about can be combined to apply to different problem areas for Python's execution speed, he said. As described, the bytecode interpreter benefits from partial evaluation, which depends on specialization. Once the bytecode sequences have been optimized with those techniques, it will be worth looking at converting them directly to machine code via a JIT compiler. Meanwhile, the cost of Python's dynamic features can be drastically reduced using specialization for the places where that dynamic nature is not being used.

The better memory layout for objects helps with Python's memory-management performance, which can also be augmented with partial evaluation. Another technique is to "unbox" numeric types so that they are no longer handled as Python objects and are simply used as regular numeric values.

While much of Python's garbage collection uses reference counting, that is not sufficient for dealing with cyclic references from objects that are no longer being used. Python has a cyclic garbage collector, but it can be a performance problem; that area can be improved with better memory layout. It may also make sense to do the cyclic collection on an incremental basis, so that different parts of the heap are handled by successive runs, reducing the amount of time spent in any given invocation of it.

The C extensions are another area that needs attention; specialization and unboxing will help reduce the overhead in moving between the two languages. A new C API could help with that as well.

So there are various aspects of running Python programs that need to be addressed and multiple techniques to do so, but there is "quite a lot of synergy here". The techniques help each other and build on each other. "Python is getting faster and we expect it to keep getting faster". The upshot is to upgrade to the latest Python, he concluded, to save energy—and money.

After the applause, he did put in his normal plea for benchmarks; the team has a standard set it uses to guide its work. "If your workloads are not represented in that set, your workloads are not necessarily getting any faster". He had time for a question, which was about the full-program memory savings from the reductions in the object size, but Shannon answered with a common refrain of his: "it depends". The savings seen will be workload-dependent but also dependent on how much Python data is being handled; in truth, he said, the layout optimizations were mostly done for the purposes of performance improvement, with memory savings as a nice added benefit.

[I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]

Index entries for this article
Conference	PyCon/2023
Python	CPython
Python	Performance

(Log in to post comments)

Faster CPython at PyCon, part two

Posted May 12, 2023 23:00 UTC (Fri) by ejr (subscriber, #51652) [Link]

ObLisp: Anyone consider the optimizations from CMUCL/SBCL's online compiler named... ... ... Python?

Faster CPython at PyCon, part two

Posted May 15, 2023 15:48 UTC (Mon) by Trou.fr (subscriber, #26289) [Link]

Thanks for the article. I'm very surprised the optimizations described in the two articles are only implemented now, as they seem basic (like caching lookups). Is there any simple/logic explanation as why nobody implemented them before?
I'd be genuinely interested in an article covering the Python ecosystem, a bit like the kernel contributors analysis. For a such a widely used language, the lack of a team focused on optimizing the runtime is extremely strange.

Faster CPython at PyCon, part two

Posted May 18, 2023 14:44 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

AFAIK there wasn't any organization (company or other) willing to put resources into funding such a team until recently. There has been plenty of interest in improving performance and reducing memory consumption, but those aren't tasks that can reasonably be done by people who are not working on the interpreter full-time. The presenter of this talk (Mark Shannon) published a whitepaper and a roadmap, but it took a sponsor (in this case Microsoft, and then others) to take up the challenge and fund a group of developers to focus on it.

Faster CPython at PyCon, part two

Posted May 21, 2023 21:06 UTC (Sun) by anton (subscriber, #25547) [Link]

Looking at the Unladen Swallow Retrospective, there was someone who was willing to do the work, and Google did initial funding, but eventually stopped funding it, because apparently nobody at Google uses Python for performance-critical work.

Still, as you write, there is interest in improving Python performance, maybe not because there is a product-related need for it, but because it is something that appeals to Python people; so the topic comes up again and again.

Faster CPython at PyCon, part two

Posted May 24, 2023 23:07 UTC (Wed) by 0xdky (guest, #141261) [Link]

Could it be to reduce the cost of running a lot of analytics and ML jobs written in Python for these large cloud providers? If they can save on compute, they either get a bigger profit or can reduce the price and get more customers.

With the AI/ML wave, there could be uptick in Python based jobs running on public clouds.

Overall, I am very happy to see this improvements and technical content on LWN. Keep it coming!

Faster CPython at PyCon, part two

Posted May 27, 2023 12:21 UTC (Sat) by Rudd-O (guest, #61155) [Link]

Google internally uses Python. But not as much. And almost all highly-performant code is C++ or Go or Java these days.

It's a matter of software quality constraints, mostly; back then, when we did not have mypy, it was comparatively harder to be certain that a Python program was gonna do what the programmer intended it to do after TAP was done TAPping, RAPID was done RAPIDing, and your PAR file was ready for your MPM (lengthy processes prior to deploy).

Modern Python with types (and a disciplined programmer who understands the type system) is almost as good as Go. Rust is still quite superior. Obviously these are only my opinions.