Hacker News new | past | comments | ask | show | jobs | submit login
An easy way to concurrency and parallelism with Python stdlib (bitecode.dev)
105 points by olsgaarddk 8 months ago | hide | past | favorite | 68 comments



I recently have been doing--what should be--straightforward subprocess work in Python, and the experience is infuriatingly bad. There are so many options for launching subprocesses and communicating with them, and each one has different caveats and undocumented limitations, especially around edge cases like processes crashing, timing out, killing them, if they are stuck in native code outside of the VM, etc.

For example, some high-level options include Popen, multiprocessing.Process, multiprocessing.Pool, futures.ProcessPoolExecutor, and huge frameworks like Ray.

multiprocessing.Process includes some pickling magic and you can pick from multiprocessing.Pipe and multiprocessing.Queue, but you need to use either multiprocessing.connection.wait() or select.select() to read the process sentinel simultaneously in case the process crashes. Which one? Well connection.wait() will not be interrupted by an OS signal. It's unclear why I would ever use connection.wait() then, is there some tradeoff I don't know about?

For my use cases, process reuse would have been nice to be able to reuse network connections and such (useful even for a single process). Then you're looking at either multiprocessing.Pool or futures.ProcessPoolExecutor. They're very similar, except some bug fixes have gone into futures.ProcessPoolExecutor but not multiprocessing.Pool because...??? For example, if your subprocess exits uncleanly, multiprocessing.Pool will just hang, whereas futures.ProcessPoolExecutor will raise a BrokenProcessPool and the pool will refuse to do any more work (both of these are unreasonable behaviors IMO). Timing out and forcibly killing the subprocess is its own adventure for each of these too. I don't care about a result anymore after some time period passes, and they may be stuck in C code so I just want to whack the process and move on, but that is not very trivial with these.

What a nightmarish mess! So much for "There should be one--and preferably only one--obvious way to do it"...my God.

(I probably got some details wrong in the above rant, because there are so many to keep track of...)

My learning: there is no "easy way to [process] parallelism" in Python. There are many different ways to do it, and you need to know all the nuances of each and how they address your requirements to know whether you can reuse existing high-level impls or you need to write your own low-level impl.


To be clear, Popen is very different from all the other options. That's for running other programs.

Process is low-level and is almost never what you want. Pool is "mid-level", and usually isn't what you want. ProcessPoolExecutor is usually what you want, it is the "one obvious way to do it". That's not at all clear from the docs though.

The one obvious way to do it, in general, is: subprocess.run for running external processes, subprocess.Popen for async interaction with external processes, and concurrent.futures.ProcessPoolExecutor for Python multiprocessing.

Your other complaints about actually using the multiprocessing stuff are 100% valid. Error handling, cancellation, etc. is all very difficult. Passing data back and forth between the main process and subprocesses is not trivial.

But I do want to emphasize that there is a somewhat-well-defined gradient of lower- and higher-level tools in the standard library, and your "obvious way to do it" should usually start at the higher end of that gradient.

You might also want to look into the third-party Joblib library, which makes process parallelism a lot less painful for the straightforward use case of "run a function on a large amount of data, using multiple OS processes."


You're saying ProcessPoolExecutor is the "one obvious way to do it" but mention how the docs don't make this clear... That makes it not obvious. And since Python has built-in async/await keywords for asyncio now, shouldn't that be the one obvious correct way of doing concurrency?

Imagining I'm a newbie to Python concurrency, I Googled "concurrency in Python" and picked the first result from the official docs. https://docs.python.org/3/library/concurrency.html It's a list of everything except asyncio, and the first item on the list is the low-level `threading` :S At least that page mentions ThreadPoolExecutor, queue, and asyncio as alternatives, but I'm still lost on what is the correct way.


I would say that criticizing the documentation is distinct from criticizing the language itself. The Python standard library has had documentation problems for a while now, but realistically so does pretty much every other programming language. If you want to learn how to do things, you need a book.

If you're still interested in the topic, async/await is intended to be single threaded by default, but has some support for pushing jobs off to threads or processes, using a concurrent.futures Executor internally. Normally if I want process parallelism however, I don't bother with async/await and I go for the more explicit solution.

Again, I think there is a very clear sense of the one obvious way to do it in the minds of many python programmers, but it might not be expressed well in the official documentation. This would be a great opportunity to write a book, for example.


The language itself has the issue of there being many separate ways to do equivalent things here. And async/await wasn't in the language until recently, so people got used to the old ways.

I didn't need a book to deal with Javascript concurrency, for example. JS had its event loop as far back as I can remember, but users are getting concurrency via that without really understanding it anyway. It got promises a while back. Async/await is just syntactical sugar on top of promises. There's hardly any other way to do things. NodeJS has extensions for subprocesses and worker threads, but you don't end up there unless you're looking for a way to do parallelism, and even then you can get by with small Stackoverflow examples.


Coming from C#, I honestly HATE python's multiprocessing and multithreading. Hell, I hate it's async await. I learned recently that in one mode, it pipes the values across the process and this made it impossible to use when passing along large pandas dataframes. I'm sure half of it is just my own lack of knowledge with python's abilities but C# sure made it easier. lol


For Pandas, I recommend the third-party Joblib library: https://joblib.readthedocs.io/en/latest/


That looks a bit low level. I would look at dask and polars. Dask scales to multiple processes on a single machine and to multiple machines and its dataframe looks pretty close pandas. Polars uses multiple cores on the same machine better than pandas (not sure about dask), but has a significantly different dataframe api than pandas. Polars, primarily through lazyframes enables much higher single core performance too.


Yeah, it's "low level" in the sense that you still have to manually chunk up your data. I agree that Dask, Polars, etc are better if you want a more transparent distributed computing experience. Joblib is great for if you already have working single-process code and you just want to parallelize it. It's what Scikit Learn uses internally, for example.

But as it pertains to the original thread topic, it's still fairly high-level. I'd consider it bit higher-level than concurrent.futures for example.


The mess more reflects supporting a programmatic interface to processes in a cross platform manner, coupled with the actual complexity of parallel processing.

You didn’t mention the recommended high level option for subprocess, ‘subprocess. run’.


Sure that exists too, but it blocks on process exit. I suppose I can run that in a separate thread but now I've got another dimension of complexity to deal with, and it's unclear if I can stream output from the subprocess?

There are other things I didn't mention that get thrown around too such as os.system() and os.fork().


For my use cases the asyncio wrapper makes it really easy to stack up a bunch of tasks, let the OS it’s thing, and then collect the results when they’re ready.


Other high-level languages do a better job with this.


This off topic rant is the top comment? Really?


Title: "An easy way to concurrency and parallelism with Python"

Content: basically how to use ThreadPoolExecutor

Comment: Concurrency and parallelism aren't easy in Python.

How is this off-topic?


It's mostly about communicating with subprocess and Popen, which has little to do with this article, other than being Python modules you can use with concurrent futures. It's also long-winded and beside the point. Shouldn't be the top comment.


Subprocessing is the only way to do full parallelism in Python. The title includes parallelism, and the article says how threading can achieve it specifically if your CPU-bound portion is inside C modules (which release the GIL), but it's relevant to mention how you do parallelism in the general case.


Problems communicating with a crashy subprocess are not what I came to the thread for. Certainly not had issues like that myself.

If you wrote the subprocess, add quality and some communication hooks. If you didn't, get a better one or kill -9 it regularly.


Python manages to combine the worst parts of high-level and low-level programming when it comes to multithreading. Like it's using multiple OS-level threads with the associated overhead (not greenthreading like in JS), except it's locking to negate actual multiprocessing, but you still have to use mutexes about as much as in C (no event loop like JS), and the whole API feels low-level and convoluted. It's like they tried to abstract things but gave up halfway through.

I like Python in general, but I avoid it for any kind of concurrent programming other than simple fan-out-fan-in.


JS doesn't have green threads, just a single threaded event loop and context switching via promises or async/await. Green threads implies parallelism implemented in user space (ala. GoLang goroutines or JVM virtual threads).. JS is not parallel only concurrent.


Greenthreading implies concurrency, not parallelism, implemented in userspace rather than OS. Two Java/whatever greenthreads atop a single OS thread cannot run in parallel. It's switching contexts (as managed in userspace) during I/O waits, just like the JS event loop. You call Goroutines threenthreading, and some Golang users would disagree, but it is too.

Some environments support "M:N" greenthreading, mapping multiple userspace threads to multiple (but fewer) OS threads that are running in parallel, but that's not a required feature of greenthreading. In this case, the OS is still doing the parallelism.

And Python is not greenthreading because the concurrency comes from the OS, since each Py thread maps 1:1 to an OS thread.


> Two Java/whatever greenthreads atop a single OS thread cannot run in parallel

Well.. yes. Actually that makes sense!

I guess I just never thought of them as green threads in JS because you don't interact with them as an object like you can in other languages.


"Greenthreading" is a weird term because it often refers to a very old Java implementation that was removed in 2000. And the Wikipedia article on the term is plain wrong in some ways.


JS has green threads.

Green threads imply only concurrency, not parallelism.

(JS also has parallelism, via worker threads, FYI)


I know this article is all about the stdlib, but having built multiple multiprocess applications with python I eventually built a library, QuasiQueue to simplify the process. I've written a few applications with it already.

https://github.com/tedivm/quasiqueue


Thank you for the article.

I use multiprocessing and I am looking forward to the GIL removal.

I would really like library writers and parallelism experts to think on modelling computation in such a way that arbitrary programs - written in this notation - can be sped up without thinking about async or parallelism or low level synchronization primitives spreading throughout the codebase, increasing its cognitive load for everybody.

If you're doing business programming and you're using python Threads or Processes directly, I think we're operating against the wrong level of abstraction because our tools are not sufficiently abstract enough. (it's not your error, it's just not ideal where our industry is at)

I am not an expert but parallelism, coroutines, async is my hobby that I journal about all the time. I think a good approach to parallelism is to split you program into a tree dataflow and never synchronize. Shard everything.

If I have a single integer value that I want to scale throughput of updates to it by × hardware threads in my multicore and SMT CPU, I can split the integer by that number and apply updates in parallel. (You have £1000 in a bank account and 8 hardware threads you split the account into 8 bank accounts and each store £125, then you can serve 8 transactions simultaneously at a time) Then periodically, those threads can post their value to another buffer (ringbuffer) and then a thread that services that ringbuffer can sum them all for a global view. This provides an eventually consistent view of an integer without slowing down throughput.

Unfortunately multithreading becomes a distributed system and then you need consensus.

I am working on barriers inspired by bulk synchronous parallel where you have parallel phases and synchronization phases and an async pipeline syntax (see my previous HN comments for notes on this async syntax)

My goal would be that business logic can be parallelised without you needing to worry about synchronization.


At least in what I do, I find 80% of my parallelism needs covered by pool.map/pool.imap_unordered. Of the remaining 20%, 80% can mostly be solved by communicating through queues or channels (though admittedly this is smoother in Erlang or Rust than in Python).

Of course that's not true for everything, and depending on the domain tree dataflows can also be great. I remember them being very popular in GPGPU tasks because synchronization is very costly there.


If I need concurrency these days, I just write it in Golang. My primary use for Python was one off scripts for cloud management / automation tasks. Today I write maybe 70% Golang and 30% Python.


I agree. The team behind Go has thought a lot about concurrency right from the start, and it really shows.


Concurrency in Go is just so easy and powerful.


Does not seem exactly like an easy way to me. Not super hard, surely, but not "easy". More like "moderately easy to do and a bit annoying to implement".

Probably 20% of the effort shown in this post could have been expended to just write something very similar in Golang, and it would have taken less time, too. Because the way I see it this is trying to emulate futures / promises (and it looks like it's succeeding, at least on the surface). That can spiral out of comfortable maintainable code territory pretty quickly.

But especially for something as trivial as a crawler, I don't see the appeal of Python. You got a good deal of languages with lower friction for doing parallel stuff nowadays (Golang, Elixir, Rust if you want to cry a bit, hell, even Lua has some parallel libraries nowadays, Zig, Nim...).


If you already know Python, the advice in this article is certainly a lot easier and more actionable than "just learn Go or Rust or Zig instead".


Certainly. My point is that if you need to write that much code and/or do that much research, at one point the effort of doing it in another language will be less than to keep insisting on using a tool that's not designed for it.

It happened with me and many other former colleagues.

Though obviously, everyone decides for themselves when does that point come -- or if it comes at all.


The point of the article is a handful of lines. The rest is accoutrement like the URL list and timing code. But sure, if

    tasks = {}
    for url in URLs:
        future = executor.submit(fetch_url, url)
        tasks[future] = url
bothers you, this is perfectly (some would say more so even than the original) Pythonic:

    tasks = {executor.submit(fetch_url, url): url for url in URLs}


I have found another way in the documentation for `concurrent.futures`. You can use `Executor.map` (https://docs.python.org/3/library/concurrent.futures.html#co...). It eliminates the need to wait on the futures explicitly.

  def main():
      with ThreadPoolExecutor(max_workers=len(URLs)) as executor:
          for url, title in zip(URLs, executor.map(fetch_url, URLs)):
              print(f"URL: {url}\nTitle: {title}")
The default value of `max_workers` since Python 3.8 has been

  min(32, os.cpu_count() + 4)
You should probably avoid

  max_workers=len(items_to_process)
It will not save memory or CPU time when you have few items (workers are created as necessary) and may waste memory when you have many.


As a side note, using a future as a map key struck be as a bit weird, though perfectly valid. It'd be more natural IMO to use a list for the futures, and have the fetch_url function return a (url, result) tuple. Or use the url as the map key and just iterate over the map items instead of using as_completed on the keys


What “much research” are you talking about?

The amusing part is that the article calls out two groups of people into which your advice falls.

It’s not that much code, it’s about 4 lines of code, creating a “pool” and calling a wait on future objects.

This is a perfect solution for Python developers who have been perfectly happy using Django for years, and just need to scrape some API or download multiple files.

No, they shouldn’t switch to a different language the moment they need to optimize something embarrassingly parallel, they can see whether a simple solution in stdlib is enough, and probably move on.


If this is too much research for you, wait until you have to deal with the many problems of Go channels in the real world. (Reasonably well-known though controversial article: [1]) Don't even get me started on Rust. Concurrency and parallelism is hard.

Yes, I've written a shit ton of code in all aforementioned languages.

[1] https://www.jtolio.com/2016/03/go-channels-are-bad-and-you-s...


> and/or do that much research,

Is reading the official docs section on concurrency lots of research?


Python is surprisingly bad at parallelism, for a data or framing workhorse.

What TFA doesn't say is that process pools are quite fragile, certainly on Mac and Windows, but Linux also. They rely on pickling which is also fragile.

That said, asyncio works surprisingly well if what you want is non-blocking execution and are happy with 1 cpu. But no parallel speed up.


After learning clojure, I found python's approach to concurrency terrible at best. Clojure is extremely easy to understand. It has basically three solutions, each for clear and defined use cases. It's much easier to judge what you should implement given a particular problem and how to do it.

I wish Python had similar solutions.


This is a really nice little guide. Much thanks to the author. Sometimes you just need to hit a bunch of APIs independently and don't want to switch your entire architecture around to do so.


Awesome article, use it a lot in a python project at work and it's quite nice how simple it is. I'm trying to replicate the python code but in Rust and it is slightly slower, more than likely my fault though as I'm new to Rust.


Is there a way to add tasks with independent timeouts using only the Python stdlib? I was reading a piece of code yesterday that had `pebble` as dependency and it looked like it was only needed for the `pool.schedule(..., timeout=1)`.


The article shows how to use ThreadPoolExecutor, but that's not fully parallel. For that, you need multiprocessing.Pool, which is slightly easier to use anyway, unless your data happens to be non-pickle-able.


When dinking around in Ipython you need to use a fork for the "multiprocessing" library called "multiprocess."

Parallelism in a Notebook isn't for everyone, but how would these changes affect it?


> For those, Python actually comes with pretty decent tools: the pool executors.

Delusion level: max.

You have to be in a very, very bad place when this marginal improvement over absolute horror-show that bare Process offers seemed "pretty decent".

Python doesn't have good tools for parallelism / concurrency. It doesn't have average tools. It doesn't have even bad tools. It has the worst. Though, unfortunately, it's not the only language in this category :(


> It doesn't have even bad tools. It has the worst.

> It's not the only language in this category

Soo....not the worst? :) Or tied for it?

What do you find difficult/wrong with pool executors?

Also, you reference "Process", but FYI the article talks about multiple threads, not multiple processes.


Pool executors only solve one kind of use case. They aren't a general solution to concurrency+parallelism.

And they're still the worst version of this pattern, because despite using multiple OS-level threads with all the associated overhead, the GIL prevents most of the real parallelism from happening. And if you want full parallelism, you have to use multiprocessing.Pool, which adds pickling overhead and incompatibility.


> Soo....not the worst? :)

Yeah... I know, it's hard to imagine that there could be more than one worst. But, as I have to practice these things with my 4 year old, I become more patient with adults who don't get the concept too.

Imagine you are in a class and the teacher gives everyone a pencil and a sheet of paper. Now, you want to find out who has the shortest pencil. All students compare their pencils and turns out that there are several pencils that are of the same exact length, and those are the shortest ones at the same time. So, more than one student has the shortest pencil.

But it doesn't end there. Not all sets which define a "greater than" relationship are totally ordered. In such sets it's possible to have multiple different smallest elements. Trivially, in a set that's not ordered, every element is the smallest.

> What do you find difficult/wrong with pool executors?

Difficult? -- I don't know.

Wrong? -- Well, it's pretty worthless... does it make it wrong? -- That's up to you to decide.

The idea of threads is bad for many reasons: one in particular is of how exceptions in threads are handled. But this isn't unique to Python. Python just made a bad decision to use threads in the language that's supposed to be "safe". Python thread implementation craps its pants when dealing with many aspects of threads. For example, thread-local variables. Since threads are objects in Python, you'd expect local variables to be properties on those objects... well the mechanism to use them is just idiotic and nothing like you would expect. When it comes to interacting with "native" code from Python, you'd expect some interaction with Python's scheduler so that the native code can portion its own execution, allow Python to interrupt it etc. but there's nothing of the kind.

Even though we haven't even gotten to the pools yet, pools, obviously, don't address any of the thread-related problems. If anything, they only amplify them. Specifically, the pool from concurrent package is worse than its relative from multiprocessing package because it uses "futures". The whole idea of "futures" is somehow broken in Python because of the neverending bugs related to deadlocking. It's been repeatedly "fixed", but every now and then deadlocks still happen. Here's the latest one I know of: https://bugs.python.org/issue46464 .

I've gone once down the rabbit hole of trying to make a native module work with Python threads... there's no good way to do it, but pools, be it from concurrent.futures or from multiprocessing are both very bad for many reasons. I was hoping to be able to give users an ability to control how parallel my native code is through the tools exposed by Python already, but that turned out to be such a disaster that I've given up on the idea. Python's thread wrappers are worthless for the native code that wants to actually execute concurrently -- they are only designed to execute Python code, non-concurrently. Like I already mentioned, Python has no infrastructure to communicate to the native code its scheduling decisions, no thread-safety in memory allocation, the code is overall poorly written (as in missing const, other imprecise typing, memory-inefficient data-structures)... there are no benefits to using that vs rolling your own. Only struggle with bad decisions.


If it's the worst, how is it not the only language in this category?

How do you rank C, Perl, JavaScript, PHP, ... parallelism compared to execution pool + futures here? The absolute MAX WORST?


It's possible to have more than one worst. In a totally-ordered collection this happens if you have two or more equal elements, which happen to be worse or equal to any other element. In a partially-ordered collection, there could be groups of elements that are not comparable between each other, and so you will potentially have multiple distinct worst elements.

Trivially, in a collection that has no "worse than" relation you can define one that doesn't compare them at all, and declares them all "incomparable" -- which, again, would make them all worst.

Bonus question: can you imagine a collection where there is no worst element?

> How do you rank C, Perl, JavaScript, PHP

Well, none of these languages have their own parallelism / concurrency aspect. (Except Perl 5 maybe? I'm not really familiar with the language). They all rely on the system running them to do the parallelism.

So... all of these will go roughly into the same bin as Python?

Some languages have libraries that would allow them to do better (eg. you have PThreads in C), but that's not the function of the language.


Well execution pool doesn't even do parallelism really, just concurrency for the most part (thanks GIL). And JavaScript handles concurrency far better than Python; its event loop is designed for just that. JS and Py can also use subprocesses for true parallelism.

C and Java threads are better than Python because, uh, they can actually run in parallel. Rust adds convenience and safety on top, plus its own event loops. Golang has Goroutines. Erlang has some very powerful solution that I don't remember.

IDK about PHP and Perl, barely touched them. Maybe they're worse than Python for this. Everything else isn't. Python was not originally built with these use cases in mind, which is totally fine, but I'm not going to pick Python if I'm doing complex concurrency/parallelism. For simple process pools, Python is good enough.


While I'm painting with broad brush, I'll guess that the parent divides languages into two categories "Go" and "the worst".


Not really.

There are languages which put at least some effort into parallelism / concurrency (and Go would be one of those along with Java, Erlang, Ada, Clojure, even C++ to some extend).

Then there are languages which outsource everything to the system: eg. Lua, Ruby. They have a way in the language to make a system call, and so if the system can create multiple processes or multiple threads, they can use that.

There are languages that have no way to do even that. For example JavaScript, XSLT or SQL. Surprisingly, a lot of these handle concurrency very well in their runtimes due to automatic parallelization performed by the runtime (not the language).

Python is the language that has neither design nor discernible goals. It has some parallelism in the language, but it's lacking important components which are then either outsourced to the system, or aren't there at all. Because of the randomness of the "design decisions" Python cannot also be reliably automatically parallelized, nor do the developers have reliable tools for building parallel applications, especially not in a modular way because different modules may not agree on the way to go about parallelization.

Python has always been a language where you need to be really knowledgeable about things outside of Python and about Python's own implementation details to get ahead. If all you knew was Python, you'd do very poorly. This is in contrast to languages like Java, which put a great deal of attention towards making sure that even the dumbest programmer will not screw up too much.

Now the people who know how to use Python well are gone, and the language is gradually transforming into Java. But it still has a very long road ahead before it can do enough hand-holding for the losers. Parallelism is one of those things where the goals are very far and so far, mostly, unattainable.


Maybe I missed it, but how do the threads circumvent the GIL?

> When a request is waiting on the network, another thread is executing.

I'm guessing this is the meat, but what controls that? What other operations allow the GIL to switch to another thread?


Python functions implemented in C can release the GIL when they're doing something that doesn't directly involve manipulating Python objects, and then re-acquire it when they're done: https://docs.python.org/3/c-api/init.html#thread-state-and-t...

All I/O functions in the standard library do this when blocked.


This is a far better explanation than the usual opaque "it's concurrent but not parallel" that I'd argue isn't even correct (cause two C calls on separate threads are running in parallel if they don't hold the GIL). Or "it's multithreading but not multiprocessing" which misses the point.


My understanding is that the GIL is typically released around blocking operations. Aside for allowing actual concurrency for I/O heavy programs, it would be a trivial way to deadlock if it wasn't.


So what is the consensus view on how to do parallelism in python if you just have something that is embarassingly parallel with no communication between processes necessary?


People here mention Pool, and I've seen it many times. It's this: https://docs.python.org/3/library/multiprocessing.html#intro...

  from multiprocessing import Pool

  def f(x):
      return x*x

  if __name__ == '__main__':
      with Pool(5) as p:
          print(p.map(f, [1, 2, 3]))
This forks out up to 5 processes. f(x) runs fully in parallel for each input. The inputs and outputs sent between processes via pickling.


if you have a task that is easy to split, make a python script that runs on a subset of the task, split into N subsets, and write one output per process? Once they all complete, join together the outputs. Maybe https://docs.dask.org/en/stable/ is a good start if you want a framework. I don't think there's a consensus, it depends on the problem.


Don't see MPI. Can skip this article.


MPI is not in the "standard" library, or am I behind the moving fast and break things?


Stdlib is keyword


The easiest and modern way is simply to use asyncio...


Which is concurrent but not parallel.


Asyncio is about reading from multiple sockets. It is not a tool for dealing with concurrency in general in programming.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: