A viable solution for Python concurrency

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
October 14, 2021

Concerns over the performance of programs written in Python are often overstated — for some use cases, at least. But there is no getting around the problem imposed by the infamous global interpreter lock (GIL), which severely limits the concurrency of multi-threaded Python code. Various efforts to remove the GIL have been made over the years, but none have come anywhere near the point where they would be considered for inclusion into the CPython interpreter. Now, though, Sam Gross has entered the arena with a proof-of-concept implementation that may solve the problem for real.

The concurrency restrictions in the CPython interpreter are driven by its garbage-collection approach, which uses reference counts on objects to determine when they are no longer in use. These counts are busy; many types of access to a Python object require a reference-count increment and (eventually) decrement. In a multi-threaded program, reference-count operations must be performed in a thread-safe manner; the alternative is to risk corrupted counts on objects. Given the frequency of these operations, corruption in multi-threaded programs would be just a matter of time, and perhaps not much time at that. To avoid such problems, the GIL only allows one thread to be running in the interpreter (i.e. to actually be running Python code) at a time; that takes away almost all of the advantage of using threads in any sort of compute-intensive code.

The reference-count problem can be trivially solved (for a relatively advanced value of "trivially") by using atomic operations to increment and decrement the counts. There is just one little problem with this solution, as outlined in this design document posted by Gross:

The simplest change would be to replace non-atomic reference count operations with their atomic equivalents. However, atomic instructions are more expensive than their non-atomic counterparts. Replacing Py_INCREF and Py_DECREF with atomic variants would result in a 60% average slowdown on the pyperformance benchmark suite.

Given that the vast majority of Python programs are single-threaded (and likely to remain so), it is not surprising that there has never been much appetite for solutions that impose this sort of cost.

Biases, immortals, and deferrals

Gross has taken three different approaches to the CPython reference-count problem, the first of which is called "biased reference counts" and is described in this paper by Jiho Choi et al. With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.

Whenever the owning thread drops a reference to an object, it checks both reference counts against zero. If both the local and the shared count are zero, the object can be freed, since no other references exist. If the local count is zero but the shared count is not, a special bit is set to indicate that the owning thread has dropped the object; any subsequent decrements of the shared count will then free the object if that count goes to zero.

This algorithm improves reference-count performance because, of all the objects that any thread will create, few will be shared with other threads. So, most of the time, the shared reference count will be unused and the cost of using atomic operations to manipulate that count will be avoided.

There are, naturally, some subtleties in how the reference counts are handled. One of those is that, for reasons to be described next, the two least-significant bits of the local reference count are reserved. An increment to the local reference count, thus, adds four to that count. These details are hidden in the Py_INCREF() and Py_DECREF() macros, so most code need not be aware of them.

Some objects are heavily shared between threads, though; these include singletons like True, False, and None, as well as small integer values, some type objects, and more. These objects will also never go away during the execution of the program — they are "immortal" objects for which reference counting is a waste. Gross's CPython interpreter marks these objects by setting the lowest significant bit in the local reference count. If that bit is set, the interpreter doesn't bother tracking references for the relevant object at all. That avoids contention (and cache-line bouncing) for the reference counts in these heavily-used objects. This "optimization" actually slows single-threaded accesses down slightly, according to the design document, but that penalty becomes worthwhile once multi-threaded execution becomes possible.

Other objects in Python programs may not be immortal, but they are still long-lived; functions and modules fall into this category. Here, too, it can make sense to avoid the cost of reference counting. The idea makes even more sense when one realizes that many function and module objects, by virtue of appearing in the globals dictionary, essentially form reference-count cycles anyway and their counts will never go to zero. For these objects, a technique called "deferred reference counting" is used; the second-least-significant bit in the local reference count is set, and (most) reference counting is skipped. Instead, a garbage-collection pass is used to find and free unused objects.

"Most" reference counting is skipped because the CPython interpreter does not, on its own, have a complete picture of whether an object using deferred reference counting is truly unused. Specifically, extension code written in C could be holding references that the interpreter cannot see. For this reason, reference counting is only skipped within the interpreter itself; any other code will manipulate the reference counts as usual.

Other changes and results

The reference-counting changes are a key part of Gross's work, but not all of it. The interpreter's memory allocator has been replaced with mimalloc, which is thread-safe, fast, and is able to easily support garbage-collection operations. The garbage collector itself has been modified to take advantage of mimalloc, but is still "a single-threaded, stop-the-world implementation". A lot of work has gone into the list and dict implementations to make them thread-safe. And so on.

Gross has also put some significant work into improving the performance of the CPython interpreter in general. This was done to address the concern that has blocked GIL-removal work in the past: the performance impact on single-threaded code. The end result is that the new interpreter is 10% faster than CPython 3.9 for single-threaded programs. That will certainly sweeten the pot when it comes to acceptance of this work, though, as Guido van Rossum noted, the Python developers could always just take the performance improvements without the concurrency work and be even faster yet.

That seems like an unlikely outcome, though, if this work stands up to closer scrutiny. When pointed to the "ccbench" benchmark, Gross reported speedups of 18-20x when running with 20 threads. That is the kind of concurrency speedup that Python developers have been wanting for a long time, so it is unsurprising that this work has seen an enthusiastic reception. As an added bonus, almost all Python programs will run on the modified interpreter without changes. The biggest source of problems might be multi-threaded programs with concurrency-related bugs that have been masked by the GIL until now. Extensions written in C will need to be recompiled, but most of them will not need to be changed unless they (as some evidently do) access reference counts directly rather than using the provided macros.

The end result thus appears to be a GIL-removal effort that has a rather better-than-average chance of making it into the CPython interpreter. That would be cause for a lot of rejoicing among Python developers. That said, a change this fundamental is unlikely to be rushed into the CPython mainline; it will take a lot of testing to convince the community that it is ready for production use. Interested developers may be able to hasten that process by testing this work with their programs and reporting the results.

Index entries for this article
Python	Global interpreter lock (GIL)

(Log in to post comments)

A viable solution for Python concurrency

Posted Oct 14, 2021 15:57 UTC (Thu) by chris_se (subscriber, #99706) [Link]

Having programmed extensions for CPython, I believe the main issue is going to be the underlying assumptions these extensions make about the immutability of Python objects while they work with them.

For example, say you have an extension that has a function foo() that takes a parameter that's a dict. Most extension functions don't drop the GIL by themselves, which means that while they are being called, no other thread can alter the dict that has been passed in to the function. However, if the GIL disappears, it may be possible that the dict is altered from another thread while the function is still accessing it. And even if all the PyDict_* functions are changed to be thread-safe themselves, people will typically call various PyDict functions in succession, and if the object changes in between calls, this can easily cause C extensions to misbehave.

This type of problem will be even more prominent when it comes to custom classes that extensions create, because no extension is currently performing any locking on their own objects right now (why would they?). On the other hand, certain types of concurrent access to objects from different threads might even be desirable -- if the structure doesn't change, I think different threads should be able to simultaneously access numpy arrays, for example, so the CPython API should provide some means of solving this.

I still think it's going to be worth-while of getting rid of the GIL and making Python much more useful when it comes to current CPUs (where core counts have increased in the past years much more than single-core processing speeds). And I welcome the efforts made here to walk in that direction. But it's clear that extension compatibility will go beyond just recompiling them -- the Python project will have to issue clear guidance for extension developers as to how to properly handle this (maybe by locking individual objects -- but if multiple objects are involved, e.g. multiple parameters passed to a function, you need to start thinking about deadlocks).

A viable solution for Python concurrency

Posted Oct 14, 2021 17:50 UTC (Thu) by azumanga (subscriber, #90158) [Link]

I agree with this. I worked on another similar parallelization attempt (for the GAP language, www.gap-system.org). Once the parallelization "core" was done, it was clear every library was going to misbehave, or crash, forever.

We fixed many of the problems by having objects owned by a single thread at a time, but then all code needs to know about transfering objects around, which is very painful for highly recursive objects.

A viable solution for Python concurrency

Posted Oct 14, 2021 19:13 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think that just means that running present-day extension code requires the same "stop the world" operation that GC does in this design, so it would be possible to provide the same behavior as today, at the cost of having C extension parallelism stay as bad as Python parallelism is currently. Of course, you'd want to allow a C method to declare that it's okay if Python code can run between its library calls, so that you can make extensions that do parallelize well.

A viable solution for Python concurrency

Posted Oct 14, 2021 19:21 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

This is not limited to C extensions. Consider for example:

>>> def foo(l, x):
...   l += [x]
... 
>>> dis.dis(foo)
  2           0 LOAD_FAST                0 (l)
              2 LOAD_FAST                1 (x)
              4 BUILD_LIST               1
              6 INPLACE_ADD
              8 STORE_FAST               0 (l)
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE

Assuming that l is a list, this function atomically appends x to it. We know that it must be atomic, because INPLACE_ADD is a single bytecode instruction, the GIL must be held for the entire execution of that instruction, and lists are implemented in C, so this instruction does not create an additional stackframe of Python bytecode. The STORE_FAST is a red herring; it just rebinds a local variable in a scope which we immediately destroy (so the interpreter could have optimized it out without altering the semantics of the function).

The problem is that I don't see a reasonable way to preserve that atomicity:

Automatic fine-grained locking would slow down the single-threaded case.
Automatic smart fine-grained locking, where the list is only locked if it's shared by multiple threads, is less likely to help than you might expect, because Python has a lot of shared mutable dictionaries in its internal implementation (basically, every scope is a dict or dict-like-thing), which are subject to the same problem. So you still end up with a lot of unnecessary (or "probably" unnecessary) locking in the multi-threaded case.
You could do some kind of lock-free data structure, but this is a general problem affecting dicts, lists, sets, and probably other data structures as well. It would require a lot of work and complexity, and I'm not even sure if every instance of this problem is solvable that way.

A viable solution for Python concurrency

Posted Oct 15, 2021 16:25 UTC (Fri) by colesbury (subscriber, #137476) [Link]

> The problem is that I don't see a reasonable way to preserve that atomicity...

The design uses "automatic fine-grained locking". The locks are acquired around mutating operations, but not read-only operations. I estimate that the locks add about 1.5% overhead. (See the sections "Collection thread-safety" and "Performance")

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfz...

A viable solution for Python concurrency

Posted Oct 16, 2021 2:40 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

Hmm. The design document lists a 7 step procedure for what was previously a three step process to read an item in a collection

1. Load the version counter from the collection
2. Load the “backing array” from the collection
3. Load the address of the item (from the “backing array”)
4. Increment the reference count of the item, if it is non-zero (otherwise retry)
5. Verify that the item still exists at the same location in the collection (otherwise retry)
6. Verify that the version counter did not change (otherwise retry)
7. Return the address of the item

Suppose you've got a shared dict foo, and thread A is constantly fiddling with it, adding stuff, removing stuff, changing stuff. But your thread B just reads foo['myValue'] which you're sure thread A has no reason to change. Today under the GIL this seems fine, but the program is slow because the GIL forbids threads A and B from both getting work done. With the 7 step procedure, often reading foo['myValue'] will fail at step 6 because thread A meanwhile incremented the version counter. I believe you would take a lock on the container then, to avoid starvation even though you're a "read-only" operation (you actually write to the reference counter)?

Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

A viable solution for Python concurrency

Posted Oct 17, 2021 2:35 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

Well, it depends on how you look at it. If you're, say, an application developer, it's easy enough to say "that seems like a bad idea, let's not do it." And in that case (i.e. the case where there's no actual mutation of shared state), the performance penalty seems pretty minor to me (if I'm understanding the process you describe correctly, and in comparison to the existing CPython overhead, which is quite large to start with). On the other hand, if you're a language developer, you don't really have the luxury of saying "this seems like a bad idea, let's not support it." Somebody may have already deployed a real-world application which depends on this behavior, so you can't unilaterally break it unless you want another backcompat flag day a la Python 3. Nobody wants to do that again.

A viable solution for Python concurrency

Posted Oct 19, 2021 11:12 UTC (Tue) by njs (subscriber, #40338) [Link]

I suspect the big win here is for data structures that have heavy read traffic from multiple threads, but few or no writes.

In this case the extra reads are pretty cheap, because the relevant cache lines can be kept loaded in "shared" state (in the MESI sense), and the branches are predictable. OTOH if reads had to acquire a lock, that would trigger cache line bouncing and be *very* expensive.

Python modules and classes are mutable data structures that are almost always read, rather than mutated. So this pattern of heavy cross-thread read-traffic to a mostly-immutable object is ubiquitous in python, so it makes sense to optimize for it.

A viable solution for Python concurrency

Posted Oct 14, 2021 21:07 UTC (Thu) by fw (subscriber, #26023) [Link]

One potential way out is to preserve the single-threaded-ness of individual interpreters, but allow multiple interpreters with independent locks in a single process. Sharing objects between interpreters will need special object types, though.

At least a few years ago, the GIL was not specific to sub-interpreters. People have actually patched glibc to be able to load libpython as many times as they like to get more independent interpreters, each with its own global data structures and locks. CPython is very different from (for example) the Lua reference implementation in this regard.

A viable solution for Python concurrency

Posted Oct 14, 2021 21:36 UTC (Thu) by atnot (subscriber, #124910) [Link]

I personally believe much more in this approach than GIL removal. Efficient message-passing and subinterpreters would let python easily use many threads without the many concurrency issues that are bound to show up in python code with fully shared memory.

A viable solution for Python concurrency

Posted Oct 14, 2021 23:56 UTC (Thu) by ms-tg (subscriber, #89231) [Link]

Please note that this is how Ruby is approaching this problem, see “Ractors”

https://github.com/ruby/ruby/blob/master/doc/ractor.md

I wonder if there should be a more socially encouraged expectation of cross pollination of ideas like this between the language communities?

A viable solution for Python concurrency

Posted Oct 15, 2021 16:52 UTC (Fri) by colesbury (subscriber, #137476) [Link]

> I wonder if there should be a more socially encouraged expectation of cross pollination of ideas like this between the language communities?

Nathaniel already wrote below about the similarity between Ractors and sub-interpreters, but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:

- Biased reference counting (originally implemented for Swift)
- mimalloc (originally developed for Koka and Lean)
- The design of the internal locks is taken from WebKit (https://webkit.org/blog/6161/locking-in-webkit/)
- The collection thread-safety adapts some code from FreeBSD (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c)
- The interpreter took ideas from LuaJIT and V8's ignition interpreter (the register-accumulator model from ignition, fast function calls and other perf ideas from LuaJIT)
- The stop-the-world implementation is influenced by Go's design (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917ef...)

A viable solution for Python concurrency

Posted Oct 15, 2021 1:18 UTC (Fri) by njs (subscriber, #40338) [Link]

This has been a major goal of a lot of cpython contributors for maybe 5? years now. Lots of internal refactoring, etc. See PEP 554.

I don't really see the point – subinterpreters aren't meaningfully more efficient than subprocesses, and they're way more complicated and fragile. But some people are more hopeful.

A viable solution for Python concurrency

Posted Oct 15, 2021 1:35 UTC (Fri) by njs (subscriber, #40338) [Link]

Also, the biggest challenge for GIL removal is legacy C extensions. And C extensions are inherently shared across the whole process.

For python code, the VM itself can take care of isolating subinterpreters from each other – for python code this is "free".

But for C code, each C extension has to manually implement subinterpreter isolation for its own internal state. So here, the subinterpreter model doesn't really simplify anything – in fact it makes it more complicated.

A viable solution for Python concurrency

Posted Oct 15, 2021 3:19 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Well, it depends on how those C extensions have been implemented. If they have large amounts of global, mutable state, then this is going to be a nightmare because you will need to make separate copies of that state for each subinterpreter, and then you have to plumb context pointers into everything, which is probably not easy.

As far as I can tell, the most likely problem for "well designed" C extensions (those which do not have substantial global mutable state) is the PyGILState_* API, which doesn't (currently) support sub-interpreters (and by my read, it is unlikely to gain such support without an API break, because there's no plausible way for it to infer which PyInterpreterState* you want it to use). You mostly only need to use those functions if you're calling into the Python interpreter from foreign threads (i.e. threads which were not created by Python). The simplest solution is probably for the Python people to provide a version of those functions which accepts an explicit PyInterpreterState* argument. Currently, for example, PyGILState_Ensure() just grabs "the" interpreter state out of a global variable.[1] This still would require some plumbing on the C extension side, of course, because extensions would still need to actually pass those arguments. In the meantime, you can use PyThreadState_New() and friends to do the same job by hand, but it's a lower-level API and more cumbersome to work with (especially if you want to tackle the reentrancy issues which PyGILState_* is intended to solve for you).

[1]: https://github.com/python/cpython/blob/main/Python/pystat...

A viable solution for Python concurrency

Posted Oct 14, 2021 22:42 UTC (Thu) by Paf (subscriber, #91811) [Link]

This is really remarkable work - it will be interesting to see if it survives the wood chipper of real world C extensions, but my lord it would be nice to get real parallelism in Python.

A viable solution for Python concurrency

Posted Oct 15, 2021 7:47 UTC (Fri) by tchernobog (subscriber, #73595) [Link]

Python 4, here it comes? I say this only half-jokingly.

A viable solution for Python concurrency

Posted Oct 15, 2021 19:49 UTC (Fri) by renejsum (guest, #124634) [Link]

My thoughts exactly :-)

Python 4 - No new Python features, but a fast concurrent Python VM for the next 30-40 years.

A viable solution for Python concurrency

Posted Oct 15, 2021 10:43 UTC (Fri) by t-v (guest, #112111) [Link]

Having seen Sam's stellar work on PyTorch, it would seem that he looked at quite a few of the larger libraries in the data science corner - PyTorch obviously (which uses PyBind11), but also numpy, scipy, pandas, scikit-learn, scikit-image, and JAX - to inform the discussion he offers on extensions.

I'm sure there will be corner cases that need to be fixed in a lot of places. But then it'll be only in situations where people use multiple threads before they hit the corner cases, and given the current multithreading state of things, that should be quite a filter.

P.S.: Regarding alternative approaches: In the PyTorch context, people have been exploring quite a few aspects of "how to get Python-based DL models to run efficiently" from the TorchScript JIT/interpreter for a subset of Python, to multiprocessing to subinterpreter-style multithreading, so to me this looks like part of this larger picture.

Java and Go

Posted Oct 14, 2021 23:55 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The design document says:

> We also aim for similar safety guarantees to Java and Go -- the language doesn’t prevent data races, but data races in user code do not corrupt the VM state.

This is confusing. Java and Go have, as I think I may have written on LWN before, quite different safety guarantees under concurrency.

Java doesn't just promise that your data race won't blow up the VM, it constrains the results of the race to just causing values touched by the race to have unpredictable but still legal values. Your Java program with a race may have strange behaviour, perhaps too strange to reasonably debug, but it's still well defined.

Go is quite different, I don't even know if you can say it won't blow up the VM. If your race touches a complex Go object such as a container, all bets are off, the program behaviour is undefined. Which very much seems like to me it could include "corrupt VM state".

Java and Go

Posted Oct 15, 2021 16:14 UTC (Fri) by colesbury (subscriber, #137476) [Link]

> This is confusing. Java and Go have, as I think I may have written on LWN before, quite different safety guarantees under concurrency.

Yes, the sentence is confusing. It's difficult to write about memory models precisely in a short amount of space. Go doesn't have "undefined behavior", but you're right that races (on multiword structures) can lead to memory corruption. I'll work on updating this section.

Russ Cox has a great article on Go's memory model: https://research.swtch.com/gomm#overview

Russ describes Go's approach to data races as a middle ground between Java's and C++'s approaches. The no-GIL CPython also occupies a middle ground, but it's a different one from Go. The motivation is similar: to make "errant programs more reliable and easier to debug."

The no-GIL CPython needs some behavior that is stronger than the Java's guarantees. For example, "dict" and "list" need to behave reasonably even with racy accesses. (In Java, racy accesses to HashMap can lead to infinite loops.) Other behavior is weaker. For example, CPython doesn't have the sandbox security of the JVM.

Java and Go

Posted Oct 16, 2021 1:26 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

It seems to me that _arbitrary memory corruption_ (Russ' words) is in fact undefined behaviour. If it just trashed the affected objects, you could imagine some actual constraints, albeit looser than Java's, for what will happen next, but there are no limits at all to what gets smashed, including in the virtual machine infrastructure and surrounding userspace data such as control structures. That's surely undefined behaviour.

I'm also dubious about this "need to behave reasonably" with an infinite loop given as an example of Java weakness. Infinite loop is a completely safe and reasonable outcome from a race, even though of course it's undesirable and a bug in your program.

My misfortunate::Always type in Rust (a type which claims to be totally ordered and yet instances stubbornly insist on always giving the same result for every comparison even to themselves) can induce infinite loops in some completely reasonable algorithms, and that's totally safe it's just annoying, or it would be if you did it by mistake.

Anyway as I also concluded previously for Java, "easier to debug" turned out to be a foolish hope. I believe Ocaml has an improvement on Java's approach (bounding the inconsistency in time) but I don't hold out much more hope for that either, and clearly non-GIL Python is not going to be better. Assume that programs with data races are just broken and unsalvageable and the reality won't disappoint you.

Java and Go

Posted Oct 16, 2021 6:05 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

Undefined behavior is behavior "for which the standard imposes no requirements." Since Go is not a formally standardized language, the concept of undefined behavior does not really apply to it as such, because there is no standard imposing any requirements in the first place.

You could certainly call it "unpredictable behavior," but it's likely more precise and informative to just call it "heap corruption" and be done with it.

A viable solution for Python concurrency

Posted Oct 15, 2021 6:05 UTC (Fri) by eru (subscriber, #2753) [Link]

>With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.

This left me wondering how a thread can efficiently determine if it is the creator of the object. Won't this require an additional field for the thread id, and checking it?

A viable solution for Python concurrency

Posted Oct 15, 2021 10:42 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Simple escape analysis?

A viable solution for Python concurrency

Posted Oct 15, 2021 15:38 UTC (Fri) by colesbury (subscriber, #137476) [Link]

> This left me wondering how a thread can efficiently determine if it is the creator of the object. Won't this require an additional field for the thread id, and checking it?

Yes, there's an additional field in the object header.

https://github.com/colesbury/nogil/blob/84c1b14af40d406a7...

The comparison is inexpensive (but not free). In the interpreter, the current thread id is kept in a local variable so the comparison compiles to something like:

cmp %r10, (%rax)
jne .L316

A viable solution for Python concurrency

Posted Oct 18, 2021 11:06 UTC (Mon) by winden (subscriber, #60389) [Link]

If the result of the check is almost always the same (eg: 99% of such checks are against a local object) then the comparison + possible jump will cost nothing because its outcome can be predicted by the CPU branch predictors.

A viable solution for Python concurrency

Posted Oct 18, 2021 14:07 UTC (Mon) by jreiser (subscriber, #11027) [Link]

The (amortized) cost is never "free". Some microarchitectures can meld the decoded CMP with adjacent Jxx into the same cycle, but many separate them into two cycles. Superscalar execution might share cycles with other operations which have independent data flow, but the sharing for CMP and/or Jxx reduces opportunities for still other operations. The CMP in the example fetches an operand from memory, which has a delay of 3 or 4 cycles even from the closest data cache. Static scheduling (by the compiler) to overlap all of those cycles is rare (and inhibits opcode melding), and dynamic scheduling of decoded micro-operations cannot always achieve 100% utilization. Regardless of scheduling, execution requires electrical energy; even nanoWatts add up.

A viable solution for Python concurrency

Posted Oct 26, 2021 3:56 UTC (Tue) by dancol (guest, #142293) [Link]

> predicted by the CPU branch predictors.

The amount of state maintained by a CPU branch predictor is finite. If the CPU is predicting this branch, it's *not* predicting some other branch.

A viable solution for Python concurrency

Posted Oct 17, 2021 3:22 UTC (Sun) by ericonr (guest, #151527) [Link]

While it looks like a fine project, seeing mimalloc used here is certainly worrying. Python is a project written mainly in portable C and reasonably easy to port, adding mimalloc into the mix wouldn't be great... For now they only seem to require arch specific asm for a memory barrier (so simple stuff), but I'd be concerned about them reaching for greater performance and potentially increasing those requirements.

And the same thing goes for CPython; it seems most atomic operations are using compiler intrinsics, which is nice to see, but there's at least one spot in https://github.com/colesbury/nogil/blob/nogil/Include/pya... that's using inline asm to implement _Py_atomic_uintptr_is_zero. Even if only as an optimization, it's harder to read and understand, and it's not a good thing to add into python code base, IMO.

And since we are talking memory allocators,

> The interpreter's memory allocator has been replaced with mimalloc, which is thread-safe, fast, and is able to easily support garbage-collection operations.

Any malloc worth their salt (and usable in the real world) is at least thread safe.

Furthermore, one concern that seems to permeate the mimalloc issue tracker is incorrect overloading of the libc malloc, making applications crash because of invalid pointers. Is there a chance memory allocated by python could be freed by a C extension or something like that, or vice versa? If python ended up using the version of mimalloc's free that doesn't crash on invalid pointers because of that, it'd be a huge loss in terms of security/hardening.

A viable solution for Python concurrency

Posted Oct 18, 2021 8:38 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

The Python/C API explicitly bans the use of malloc() and free() for PyObject* pointers. It is legal to use malloc and free for your own private C (non-PyObject*) objects, or for C objects which the Python/C API will only temporarily borrow (e.g. to make a copy), but extension authors are encouraged to use the PyMem_Raw* functions for that purpose instead. See https://docs.python.org/3/c-api/memory.html

In general, Python objects have their allocation and deallocation behavior determined by their associated type object, which is itself a Python object (at the REPL, you can retrieve the type of an object with type(foo) - at the C level, it's a struct field on the object). Type objects have struct fields which point to allocation and deallocation functions, which are called when instances are allocated and deallocated. See https://docs.python.org/3/c-api/typeobj.html#pytypeobject.... Normally, these are called automatically by PyObject_New and PyObject_Del (or Py_DECREF() when the reference count hits zero), so a C extension running around calling malloc and free explicitly would be very weird and well outside how the API is meant to be used.

A viable solution for Python concurrency

Posted Oct 19, 2021 4:47 UTC (Tue) by ericonr (guest, #151527) [Link]

Thanks for the explanation, that's great to know. Anyone whose extension broke with mimalloc would already be broken with current cpython.

A viable solution for Python concurrency

Posted Oct 22, 2021 17:40 UTC (Fri) by flussence (subscriber, #85566) [Link]

I've got to second this reaction. Never thought I'd see the day!

A viable solution for Python concurrency

Posted Oct 26, 2021 18:27 UTC (Tue) by t-v (guest, #112111) [Link]

On his blog, Łukasz Langa (Python developer in residence) reports on discussions between Python devs and Sam:
https://lukasz.langa.pl/5d044f91-49c1-4170-aed1-62b6763e6...

A viable solution for Python concurrency

Posted Nov 1, 2021 0:39 UTC (Mon) by SomeOtherGuy (guest, #151918) [Link]

It may seriously be easier for a python4 and just changing the API for C extensions ....this pains me (especially as I hate how 2.7 was handled) but we're used to 2.7 and 3.x being around.... why not just go for 4?

It can't be more painful surely?

BTW I get a choice (and it's one that must be made) between:
Python 3.4.3
and
Python 2.7.6

I am aware these are both ancient, this means I can't use half the new stuff anyway, so maybe that'd cause pain?

But yeah I'm used to python/python3 - why not python4....

It really couldn't be worse and this is worth getting right