Hacker News new | past | comments | ask | show | jobs | submit login
Async Python is not faster (calpaterson.com)
472 points by haybanusa on June 12, 2020 | hide | past | favorite | 353 comments



How is this result surprising? The point of coroutines isn't to make your code execute faster, it's to prevent your process sitting idle while it waits for I/O.

When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting. Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.


I think it is surprising to a lot of people who do take it as read that async will be faster.

As I describe in the first line of my article I don't think that people who think async is faster have unreasonable expectations. It seems very intuitive to assume that greater concurrency would mean greater performance - at least one some measure.

> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting.

I'm afraid I also don't think you have this right conceptually. An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.

Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.


> Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.

Sure, the operating system can find other things to do with the CPU cycles when a program is IO-locked, but that doesn't help the program that you're in the situation of currently trying to run.

> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.

You're right. "Arbitrary programs will run faster" is not the promise of Python async.

Python async does help a program work faster in the situation that phodge just described (waiting for web requests, or waiting for a slow hardware device), since the program can do other things while waiting for the locked IO (unlike a Python program that does not use async and could only proceed linearly through its instructions). That's the problem that Python asyncio purports to solve. It is still subject to the Global Interpreter Lock, meaning it's still bound to one thread. (Python's multiprocessing library is needed to overcome the GIL and break out a program into multiple threads, at the cost that cross-thread communication now becomes expensive).


> unlike a Python program that does not use async and could only proceed linearly through its instructions

This isn't how it works. While Python is blocked in I/O calls, it releases the GIL so other threads can proceed. (If the GIL were never released then I'm sure they wouldn't have put threading in the Python standard library.)

> Python's multiprocessing library is needed to overcome the GIL

This is technically true, in that if you are running up against the GIL then the only way to overcome it is to use multiprocessing. But blocking IO isn't one of those situations, so you can just use threads.

The comparison here is not async vs just doing one thing. It's async vs threads. I believe that's what the performance comparison in the article is about, and if threads were as broken as you say then obviously they wouldn't have performed better than asyncio.

--------

As an aside, many C-based extensions also release the GIL when performing CPU-bound computations e.g. numpy and scipy. So GIL doesn't even prevent you from using multithreading in CPU-heavy applications, so long as they are relatively large operations (e.g. a few calls to multiply huge matrices together would parallelise well, but many calls to multiply tiny matrices together would heavily contend the GIL).


> > Python's multiprocessing library is needed to overcome the GIL

> No it's not, just use threads.

I just wanted to expand on this a little to describe some of the downsides to threads in Python.

Multi-threaded logic can be (and often is) slower than single-threaded logic because threading introduces overhead of lock contention and context switching. David Beazley did a talk illustrating this in 2010:

https://www.youtube.com/watch?v=Obt-vMVdM8s

He also did a great talk about coroutines in 2015 where he explores threading and coroutines a bit more:

https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s

In workloads that are often "blocked" like network calls our I/O bound work loads, threads can provide similar benefits to coroutines but with overhead. Coroutines seek to provide the same benefit without as much overhead (no lock contention, fewer context switches by the kernel).

It's probably not the right guidelines for everyone but I generally use these when thinking about concurrency (and pseudo-concurrency) in Python:

- Coroutines where I can.

- Multi-processing where I need real concurrency.

- Never threads.


Ah ha! Now we have finally reached the beginning of the conversation :-)

The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief. There seems to be some debate about whether the article is really representative, and I'm very curious about that. But then the parent comment to mine took us on an unproductive detour that based on the misconception that Python threads don't work at all. Now your comment has brought up that original belief again, but you haven't referenced the article at all.


I didn't reference the article because I provided more detailed references which explore the difference between threads and coroutines in Python to a much greater depth.

The point of my comment is to say that neither threads or coroutines will make Python _faster_ in and of themselves. Quite the opposite in fact: threading adds overhead so unless the benefit is greater than the overhead (e.g. lock contention and context switching) your code will actually be net slower.

I can't recommend the videos I shared enough, David Beazley is a great presenter. One of the few people who can do talks centered around live coding that keep me engaged throughout.

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief.

The disconnect here is that this article isn't claiming that asyncio is not faster than threads. In fact the article only claims that asyncio is not a silver bullet guaranteed to increase the performance of any Python logic. The misconception it is trying to clear up, in it's own words is:

> Sadly async is not go-faster-stripes for the Python interpreter.

What I, and many others are questioning is:

A) Is this actually as widespread a belief as the article claims it to be? None of the results are surprising to me (or apparently some others).

B) Is the article accurate in it's analysis and conclusion?

As an example, take this paragraph:

> Why is this? In async Python, the multi-threading is co-operative, which simply means that threads are not interrupted by a central governor (such as the kernel) but instead have to voluntarily yield their execution time to others. In asyncio, the execution is yielded upon three language keywords: await, async for and async with.

This is a really confusing paragraph because it seems to mix terminology. A short list of problems in this quote alone:

- Async Python != multi-threading.

- Multi-threading is not co-operatively scheduled, they are indeed interrupted by the kernel (context switches between threads in Python do actually happen).

- Asyncio is co-operatively scheduled and pieces of logic have to yield to allow other logic to proceed. This is a key difference between Asyncio (coroutines) and multi-threading (threads).

- Asynchronous Python can be implemented using coroutines, multi-threading, or multi-processing; it's a common noun but the quote uses it as a proper noun leaving us guessing what the author intended to refer to.

Additionally, there are concepts and interactions which are missing from the article such as the GIL's scheduling behavior. In the second video I shared, David Beazley actually shows how the GIL gives compute intensive tasks higher priority which is the opposite of typical scheduling priorities (e.g. kernel scheduling) which leads to adverse latency behavior.

So looking at the article as a whole, I don't think the underlying intent of the article is wrong, but the reasoning and analysis presented is at best misguided. Asyncio is not a performance silver bullet, it's not even real concurrency. Multi-processing and use of C extensions is the bigger bang for the buck when it comes to performance. But none of this is surprising and is expected if you really think about the underlying interactions.

To rephrase what you think I thought:

> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO.

Is actually more like:

> Asyncio is more efficient than multi-threading in Python. It is also comparatively more variable than multi-processing, particularly when dealing with workloads that saturate a single event loop. Neither multi-threading or Asyncio is actually concurrent in Python, for that you have to use multi-processing to escape the GIL (or some C extension which you trust to safely execute outside of GIL control).

---

Regarding your aside example, it's true some C extensions can escape the GIL, but often times it's with caveats and careful consideration of where/when you can escape the GIL successfully. Take for example this scipy cookbook regarding parallelization:

https://scipy-cookbook.readthedocs.io/items/ParallelProgramm...

It's not often the case that using a C extension will give you truly concurrent multi-threading without significant and careful code refactoring.


For single processes you’re right, but this article (and a lot of the activity around asyncio in Python) is about backend webdev, where you’re already running multiple app servers. In this context, asyncio is almost always slower.


> But blocking IO isn't one of those situations, so you can just use threads.

Threads and async are not mutually exclusive. If your system resources aren't heavily loaded, it doesn't matter, just choose the library you find most appropriate. But threads require more system overhead, and eventually adding more threads will reduce performance. So if it's critical to thoroughly maximize system resources, and your system cannot handle more threads, you need async (and threads).


> But threads require more system overhead, and eventually adding more threads will reduce performance.

Absolutely false. OS threads are orders of magnitude lighter than any Python coroutine implementation.


> OS threads are orders of magnitude lighter than any Python coroutine implementation.

But python threads, which have extra weight on top of an cross-platform abstraction layer on top of the underlying OS threads, are not lighter than python coroutines.

You aren't choosing between Python threads and unadorned OS threads when writing Python code.


You're absolutely right.

I'm pointing out that this is a Python problem, not a threads problem, a fact which people don't understand.


Everyone has been discussing relative performance of different techniques within Python; there is neither a basis to suggest from that that people don't understand that aspects of that are Python specific, nor a reason to think that that is even particularly relevant to the discussion.


Okay, then let's do a bakeoff! You outfit a Python webserver that only uses threads, and I'll outfit an identical webserver that also implements async. Server that handling the most requests/sec wins. I get to pick the workload.


FWIW, I have a real world Python3 application that does the following:

- receives an HTTP POST multipart/form-data that contains three file parts. The first part is JSON.

- parses the form.

- parses the JSON.

- depending upon the JSON accepts/rejects the POST.

- for accepted POSTs, writes the three parts as three separate files to S3.

It runs behind nginx + uwsgi, using the Falcon framework. For parsing the form I use streaming-form-data which is cython accelerated. (Falcon is also cython accelerated.)

I tested various deployment options. cpython, pypy, threads, gevent. Concurrency was more important than latency (within reason). I ended up with the best performance (measured as highest RPS while remaining within tolerable latency) using cpython+gevent.

It's been a while since I benchmarked and I'm typing this up from memory, so I don't have any numbers to add to this comment.


Each Linux thread has at least an 8MB virtual memory overhead. I just tested it, and was able to create one million coroutines in a few seconds and with a few hundred megabytes of overhead in Python. If I created just one thousand threads, it would take possibly 8 gigs of memory.


Virtual memory is not memory. You're effectively just bumping an offset, there's no actual allocations involved.

> ...it would take possibly 8 gigs of memory.

No. Nothing is 'taken' when virtual memory is requested.


But have you tried creating one thousand of OS threads and measuring the actual memory usage? If I recall correctly I read some article where it was explained that threads in Linux are not actually claiming their 8MB each so literally. I need to recheck that later.


You're right, I've read the same. Using Python 3.8, creating 12,000 threads with `time.sleep` as the target clocks in at 200MB residential memory.


People seem to keep misunderstanding the GIL. It's the Global Interpeter Lock, and it's effectively the lock around all Python objects and structures. This is necessary because Python objects have no thread ownership model, and the development team does not want per-object locks.

During any operation that does not need to modify Python objects, it is safe to unlock the GIL. Yielding control to the OS to wait on I/O is one such example, but doing heavy computation work in C (e.g. numpy) can be another.


To clarify that the CPython devs aren't being arbitrary here: There have been attempts at per-object or other fine-grained locking, and they appear to be less performant than a GIL, particularly for the single-threaded case.

Single-threaded performance is a major issue as that's most Python code.


Yes. I expect generic fine-grained locking, especially per-object leaks, to be less performant for multi-threaded code too, as locks aren't cheap, and even with the GIL, lock overhead could still be worse than a good scheduler.

Any solution which wants to consider per-object locking has to consider removing refcounting, or locking the refcount bits separately, as locking/unlocking objects to twiddle their refcounts is going to be ridiculously expensive.

Ultimately, the Python ownership and object model is not condusive to proper threading, as most objects are global state and can be mutated by any thread.


Instead of disagreeing with some of your vague assertions I'll just make my own points for people that want to consider using async.

Workers (usually live in a new process) are not efficient. Processes are extremely expensive and subjectively harder for exception handling. Threads are lighter weight..and even better are async implementations that use a much more scalable FSM to handle this.

Offloading work to things not subjective to the GIL is the reason async Python got so much traction. It works really well.


This is often a point of confusion for people when looking at Erlang, Elixir or Go code. Concurrency beyond leveraging available CPU's doesn't really add any advantage.

On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

It's one of the big reasons I've always wanted to see Techempower create a test version that continues to increase concurrency beyond 512 (as high as maybe 10k). I think it would be interesting.


> On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.

Python doesn't block on I/O.


Of course it does.


It releases the GIL.

Edit: sorry I can do better.

If you're using async/await to not block on I/O while handling a request, you still have to wait for that I/O to finish before you return a response. Async adds overhead because you schedule the coroutine and then resume execution.

The OS is better at scheduling these things because it can do it in kernel space in C. Async/await pushes that scheduling into user space, sometimes in interpreted code. Sometimes you need that, but very often you don't. This is in conflict with "async the world", which effectively bakes that overhead into everything. This explains the lower throughput, higher latency, and higher memory usage.

So effectively this means "run more processes/threads". If you can only have 1 process/thread and cannot afford to block, then yes async is your only option. But again that case is pretty rare.


From my understanding the primary use of concurrency in Erlang/Elixir is for isolation and operational consistency. Do you believe that not to be the case?


The primary use of concurrency in Erlang is modelling a world that is concurrent.

If you go back to the origins of Erlang, the intent was to build a language that would make it easier to write software for telecom (voice) switches; what comes out of that is one process for each line, waiting for someone to pick up the line and dial or for an incoming call to make the line ring (and then connecting the call if the line is answered). Having this run as an isolated process allows for better system stability --- if someone crashes the process attached to their line, the switch doesn't lose any of the state for the other lines.

It turns out that a 1980s design for operational excellence works really well for (some) applications today. Because the processes are isolated, it's not very tricky to run them in parallel. If you've got a lot of concurrent event streams (like users connected via XMPP or HTTP), assigning each a process makes it easy to write programs for them, and because Erlang processes are significantly lighter weight than OS processes or threads, you can have millions of connections to a machine, each with its own process.

You can absolutely manage millions of connections in other languages, but I think Erlang's approach to concurrency makes it simpler to write programs to address that case.


That's a big topic. The shortest way I can summarize it though:

Immutable data, heap isolated by concurrent process and lack of shared state, combined with supervision trees made possible because of extremely low overhead concurrency, and preemptive scheduling to prevent any one process from taking over the CPU...create that operational consistency.

It's a combination of factors that have gone into the language design that make it all possible though. Very big and interesting topic.

But it does create a significant capacity increase. Here's a simple example with websockets.

https://dockyard.com/blog/2016/08/09/phoenix-channels-vs-rai...


this is true for compiled languages as the ones you mention, but generally does not apply to Python, which as an interpreted language tends to add CPU overhead for even the smallest tasks.


CPU can do billions of operations every second. When you have 200ms for every request that overhead is not that large, you're still blocked by I/O.


for local services like databases, real world benchmarks disagree.


You should add that you mean just databases. I've just looked at your profile and as I understand it's your focus.

I built a service that was making a lot of requests. Much enough that at some point we've run out of 65k connections limit for basic Linux polling (we needed to switch to kpoll). Some time after that we've ran out of other resources and switching from threads to threads+greenlets really solved our problem.


>... is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things.

This is very true, especially when actual work is involved.

Remember, the kernel uses the exact same mechanism to have a process wait on a synchronous read/write, as it does for a processes issuing epoll_wait. Furthermore, isolating tasks into their own processes (or, sigh, threads), allows the kernel scheduler to make much better decisions, such as scheduling fairness and QoS to keep the system responsive under load surges.

Now, async might be more efficient if you serve extreme numbers of concurrent requests from a single thread if your request processing is so simple that the scheduling cost becomes a significant portion of the processing time.

... but if your request processing happens in Python, that's not the case. Your own scheduler implementation (your event loop) will likely also end up eating some resources (remember, you're not bypassing anything, just duplicating functionality), and is very unlikely to be as smart or as fair as that of the kernel. It's probably also entirely unable to do parallel processing.

And this is all before we get into the details of how you easily end up fighting against the scheduler...


Yeah except nodejs will beat flask in this same exact benchmark. Explain that.


CPython doesn't have a JIT, while node.js does. If you want to compare apples to apples, try looking at Flask running on PyPy.


Ed: after reading the article, I guess it's safe to say that everything below is false :)

---

I'd guess the c++ event loop is more important than the jit?

Maybe a better comparison is quart (with eg uvicorn)

https://pgjones.gitlab.io/quart/

https://www.uvicorn.org/

Or Sanic / uvloop?

https://sanicframework.org/

https://github.com/MagicStack/uvloop


Plain sanic runs much faster than the uvicorn-ASGI-sanic stack used in the benchmark, and the ASGI API in the middle is probably degrading other async frameworks' performance too. But then this benchmark also has other major issues, like using HTTP/1.0 without keep-alive in its Nginx proxy_pass config (keep-alive again has a huge effect on performance, and would be enabled on real performance-critical servers). https://sanic.readthedocs.io/en/latest/sanic/nginx.html


Interesting, thank you. I wasn't aware nginx was so conservative by default.

https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...


You're not completely off. There might be issues with async/await overhead that would be solved by a JIT, but also if you're using asyncio, the first _sensible_ choice to make would be to swap out the default event loop with one actually explicitly designed to be performant, such as uvloop's one, because asyncio.SelectorEventLoop is designed to be straightforward, not fast.

There's also the major issue of backpressure handling, but that's a whole other story, and not unique to Python.

My major issue with the post I replied to is that there are a bunch of confounding issues that make the comparison given meaningless.


The database is the bottleneck. JIT or even C++ shouldn't even be a factor here. Something is wrong with the python implimentation of async await.


If I/O-bound tasks are the problem, that would tend to indicate an issue with I/O event loop, not with Python and its async/await implementation. If the default asyncio.SelectorEventLoop is too slow for you, you can subclass asyncio.AbstractEventLoop and implement your own, such as buildiong one on top of uvloop. And somebody's already done that: https://github.com/MagicStack/uvloop

Moreover, even if there's _still_ a discrepancy, unless you're profiling things, the discussion is moot. This isn't to say that there aren't problems (there almost certainly are), but that you should get as close as possible to an apples-to-apples comparison first.


When I talk about async await I'm talking about everything that encompasses supporting that syntax. This includes the I/O event loop.

So really we're in agreement. You're talking about reimplementing python specific things to make it more performant, and that is exactly another way of saying that the problem is python specific.


No, we're not in agreement. You're confounding a bunch of independent things, and that is what I object to.

It's neither fair nor correct to mush together CPython's async/await implementation with the implementation of asyncio.SelectorEventLoop. They are two different things and entirely independent of one another.

Moreover, it's neither fair nor correct to compare asyncio.SelectorEventLoop with the event loop of node.js, because the former is written in pure Python (with performance only tangentally in mind) whereas the latter is written in C (libuv). That's why I pointed you to uvloop, which is an implementation of asyncio.AbstractEventLoop built on top of libuv. If you want to even start with a comparison, you need to eliminate that confounding variable.

Finally, the implementation matters. node.js uses a JIT, while CPython does not, giving them _much_ different performance characteristics. If you want to eliminate that confounding variable, you need to use a Python implementation with a JIT, such as PyPy.

Do those two things, and then you'll be able to do a fair comparison between Python and node.js.


Except the problem here is that those tests were bottlenecked by IO. Whether you're testing C++, pypy, libuv, or whatever it doesn't matter.

All that matters is the concurrency model because that application he's running is barely doing anything else except IO and anything outside of IO becomes negligible because after enough requests, those sync worker processes will all be spending the majority of their time blocked by an IO request.

The basic essence of the original claim is that sync is not necessarily better than async for all cases of high IO tasks. I bring up node as a counter example because that async model IS Faster for THIS same case. And bringing up node is 100% relevant because IO is the bottleneck, so it doesn't really matter how much faster node is executing as IO should be taking most of the time.

Clearly and logically the async concurrency model is better for these types of tasks so IF tests indicate otherwise for PYTHON then there's something up with python specifically.

You're right, we are in disagreement. I didn't realize you completely failed to understand what's going on and felt the need to do an apples to apples comparison when such a comparison is not Needed at all.


No, I understand. I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense. Get rid of those and then we can look at why "nodejs will beat flask in this same exact benchmark".


> I just think that your comparison with _node.js_ when there are a bunch of confounding variables is nonsense

And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

Why? Because the benchmark test in the article is a test where every single task is 99% bound by IO.

What each task does is make a database call AND NOTHING ELSE. Therefore you can safely say that for either python or Node request less than 1% of a single task will be spent on processing while 99% of the task is spent on IO.

You're talking about scales on the order of 0.01% vs. 0.0001%. Sure maybe node is 100x faster, but it's STILL NEGLIGIBLE compared to IO.

It it _NOT_ Nonsense.

You Do not need an apples to apples comparison to come to the conclusion that the problem is Specific to the python implementation. There ARE NO confounding variables.


> And I'm saying all those confounding variables you're talking about are negligible and irrelevant.

No, you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent. You're assuming the issue lies in one place (Python's async/await implementation) when there are a bunch of possible contributing factors _which have not been ruled out_.

Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

Show me actual numbers. Prove there are no confounding variables. You made an assertion that demands evidence and provided none.


>Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.

It's data science that is causing this data driven attitude to invade peoples minds. Do you not realize that logic and assumptions take a big role in drawing conclusions WITHOUT data? In fact if you're a developer you know about a way to DERIVE performance WITHOUT a single data point or benchmark or profile. You know about this method, you just haven't been able to see the connections and your model about how this world works (data driven conclusions only) is highly flawed.

I can look at two algorithms and I can derive with logic alone which one is O(N) and which one is O(N^2). There is ZERO need to run a benchmark. The entire theory of complexity is a mathematical theory used to assist us at arriving AT PERFORMANCE conclusions WITHOUT EVIDENCE/BENCHMARKS.

Another thing you have to realize is the importance of assumptions. Things like 1 + 1 = 2 will remain true always and that a profile or benchmark ran on a specific task is an accurate observation of THAT task. These are both reasonable assumptions to make about the universe. They are also the same assumptions YOU are making everytime you ask for EVIDENCE and benchmarks.

What you aren't seeing is this: The assumptions I AM making ARE EXACTLY THE SAME: reasonable.

>you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent

Let's take it from the top shall we.

I am making the assumption that tasks done in parallel ARE Faster than tasks done sequentially.

The author specifically stated he made a server that where each request fetches a row from the database. And he is saying that his benchmark consisted of thousands of concurrent requests.

I am also making the assumption that for thousands of requests and thousands of database requests MOST of the time is spent on IO. It's similar to deriving O(N) from a for loop. I observe the type of test the author is running and I make a logical conclusion on WHAT SHOULD be happening. Now you may ask why is IO specifically taking up most of the time of a single request a reasonable assumption? Because all of web development is predicated on this assumption. It's the entire reason why we use inefficient languages like python, node or Java to run our web apps instead of C++, because the database is the bottleneck. It doesn't matter if you use python or ruby or C++, the server will always be waiting on the db. It's also a reasonable assumption given my experience working with python and node and databases. Databases are the bottleneck.

Given this highly reasonable assumption, and in the same vein as using complexity theory to derive performance speed, it is highly reasonable for me to say that the problem IS PYTHON SPECIFIC. No evidence NEEDED. 1 + 1 = 2. I don't need to put that into my calculator 100 times to get 100 data points for some type of data driven conclusion. It's assumed and it's a highly reasonable assumption. So reasonable that only an idiot would try to verify 1 + 1 = 2 using statistics and experiments.

Look you want data and no assumptions? First you need to get rid of the assumption that a profiler and benchmark is accurate and truthful. Profile the profiler itself. But then your making another assumption: The profiler that profiled the profiler is accurate. So you need to get me data on that as well. You see where this is going?

There is ZERO way to make any conclusion about anything without making an assumption. And Even with an assumption, the scientific method HAS NO way of proving anything to be true. Science functions on the assumption that probability theory is an accurate description of events that happen in the real world AND even under this assumption there is no way to sample all possible EVENTS for a given experiment so we can only verify causality and correlations to a certain degree.

The truth is blurry and humans navigate through the world using assumptions, logic and data. To intelligently navigate the world you need to know when to make assumptions and when to use logic and when data driven tests are most appropriate. Don't be an idiot and think that everything on the face of the earth needs to be verified with statistics, data and A/B tests. That type of thinking is pure garbage and it is the same misguided logic that is driving your argument with me.


Buddy, you can make all the "logical arguments" you want, but if you can't back up them up with evidence, you're just making guesses.


Nodejs is faster than Python as a general rule, anyway. As I understand, Nodejs compiles Javascript, Python interprets Python code.

I do a lot of Django and Nodejs and Django is great to sketch an app out, but I've noticed rewriting endpoints in Nodejs directly accessing postgres gets much better performance.

Just my 2c


CPython, the reference implementation, interprets Python. PyPy interprets and JIT compiles Python, and more exotic things like Cython and Grumpy statically compiles Python (often through another, intermediate language like C or Go).

Node.js, using V8, interprets and JIT compiles JavaScript.

Although note that, while Node.js is fast relative to Python, it's still pretty slow. If you're writing web-stuff, I'd recommend Go instead for casually written, good performance.


The compare between Django against no-ORM is a bit weird given that rewriting your endpoint in python without Django or ORM would also have produced better results I suppose.


Right but this test focused on concurrent IO. The bottleneck is not the interpreter but the concurrency model. It doesn't matter if you coded it in C++, the JIT shouldn't even be a factor here because the bottleneck is IO and therefore ONLY the concurrency model should be a factor here. You should only see differences in speed based off of which model is used. All else is negligible.

So you have two implementations of async that are both bottlenecked by IO. One is implemented in node. The other in python.

The node implementation behaves as expected in accordance to theory meaning that for thousands of IO bound tasks it performs faster then a fixed number of sync worker threads (say 5 threads).

This makes sense right? Given thousands of IO bound tasks, eventually all 5 threads must be doing IO and therefore blocked on every task, while the single threaded async model is always context switching whenever it encounters an IO task so it is never blocked and it is always doing something...

Meanwhile the python async implementation doesn't perform in accordance to theory. 5 async workers is slower then 5 sync workers on IO bound tasks. 5 sync workers should eventually be entirely blocked by IO and the 5 async workers should never be blocked ever... Why is the python implementation slower? The answer is obvious:

It's python specific. It's python that is the problem.


JIT compiler.


Bottleneck is IO. Concurrency model should be the limiting factor here.

NodeJS is faster than flask because of the concurrency model and NOT because of the JIT.

The python async implementation being slower than the python sync implementation means one thing: Something is up with python.

The poster implies that with the concurrency model the outcome of these tests are expected.

The reality is, these results are NOT expected. Something is going on specifically with the python implementation.


You mean express.js ?


NodeJS primitives are enough to produce the same functionality as flask without the need for an extra framework.


Async IO was in large part a response to "how can my webserver handle xx thousand connections per second" (or in the case of Erlang "how do you handle millions of phone calls at once"). Starting 15 threads to do IO works great, but once you wait for hundreds of things at once the overhead from context switching becomes a problem, and at some point the OS scheduler itself becomes a problem


Not really. At least on Linux, the scheduler is O(1). There is no difference between one process waiting for 10k connections, or 10k processes waiting for 1 each. And there is hardly a context switch either, if all these 10k processes use the same memory map (as threads do).

I've tested this extensively on Linux. There is no more CPU used for threads vs epoll.

On the other hand, if you don't get the epoll imementation exactly right, you may end up with many spurious calls. E.g. simply reading slow data from a socket in golang on Linux incurs considerable overhead: a first read that is short, another read that returns EWOULDBLOCK, and then a syscall to re-arm the epoll. With OS threads, that is just a single call, where the next call blocks and eventually returns new data.

Edit: one thing I haven't considered when testing is garbage collection. I'm absolutely convinced that up to 10k connections, threads or async doesn't matter, in C or Rust. But it may be much harder to do GC over 10k stacks than over 8.


I recently have read a block with benchmarks doing that for well written C in their use case async io only becomes faster then using threads from around 10k parallel connections. (Through the difference was negligible).

This seems to also be a major behind io_uring.


I don't think this is true? At least, I've never seen the issue of OS threads be that context switching is slow.

The issue is memory usage, which OS threads take a lot of.

Would userland scheduling be more CPU efficient? Sure, probably in many cases. But I don't think that's the problem with handling many thousands of concurrent requests today.


> is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things

Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.


> Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.

This hardly matters when spinning up a few thousand threads. Only memory that's actually used is committed, one 4k page at a time. What is 10MB these days? And that is main memory, while it's much more interesting what fits in cache. At that point it doesn't matter if your data is in heap objects or on a stack.

Add to that the fact that Python stacks are mostly on the heap, the real stack growing only due to nested calls in extensions. It's rare for a stack in Python to exceed 4k.


Languages that to green threads don't do them for memory savings, but to save on context switches when a thread is blocked and cannot run. System threads are scheduled by the OS, green threads my the language runtime, which saves a context switch.


Green threads are scheduled by the language runtime and by the OS. If the OS switches from one thread to another in the same process, there is no context switch, really, apart from the syscall itself which was happening anyway (the recv that blocks and causes the switch). At least not on Linux, where I've measured the difference.


This is not what is happening with flask/uwsgi. There is a fixed number of threads and processes with flask. The threads are only parallel for io and the processes are parallel always.


Which is fine until you run out of uwsgi workers because a downstream gets really slow sometime. The point of async python isn't to speed things up, it's so you don't have to try to guess the right number of uwsgi workers you'll need in your worst case scenario and run with those all the time.


Yep and this test being shown is actually saying that about 5 sync workers acting on thousands of requests is faster then python async workers.

Theoretically it makes no sense. A Task manager executing tasks in parallel to IO instead of blocking on IO should be faster... So the problem must be in the implementation.


> I think it is surprising to a lot of people who do take it as read that async will be faster.

Literally the first thing any concurrency course starts with in the very first lesson is that scheduling and context overhead are not negligible. Is it so hard to expect our professionals to know basic principles of what they are dealing with?


> think it is surprising to a lot of people who do take it as read that async will be faster.

This is because when they are first shown it, the examples are faster, effectively at least, because the get given jobs done in less wallclock time due to reduced blocking.

They learn that but often don't get told (or work out themselves) that in many cases the difference is so small as to be unmeasurable or in other circumstances the can be negative effects (overheads others have already mentioned in the framework, more things waiting on RAM with a part processed working day which could lead to thrashing in a low memory situation, greater concurrent load on other services such as a database and the IO system it depends upon, etc).

As a slightly of-the-topic-of-async example, back when multi-core processing was first becoming cheap enough that it was not just affordable at give but the default option, I had great trouble trying to explain to a colleague why two IO intensive database processes he was running were so much slower than when I'd shown him the same process (I'd run them sequentially). He was absolutely fixated on the idea that his four cores should make concurrency the faster option, I couldn't get through that in this case the flapping heads on the drives of the time were the bottleneck and the CPU would be practically idle no matter how many cores it had while the bottleneck was elsewhere.

Some people learn the simple message (async can handle some loads much more efficiently) as an absolute (async is more efficient) and don't consider at all that the situation may be far more nuanced.


> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process

You mean concurrent tasks in the same process?


> I don't think that people who think async is faster have unreasonable expectations

I do.

And I don't think I'm alone nor being unreasonable.


> The point of coroutines isn't to make your code execute faster, it's to prevent your process sitting idle while it waits for I/O.

This is a quintessential example of not seeing the forest for the trees.

The point of coroutines is absolutely to make my code execute faster. If a completely I/O-bound application sits idle while it waits for I/O, I don't care and I should not care because there's no business value in using those wasted cycles. The only case where coroutines are relevant is when the application isn't completely I/O bound; the only case where coroutines are relevant is when they make your code execute faster.

It's been well-known for a long time that the majority of processes in (for example) a webserver, are I/O bound, but there are enough exceptions to that rule that we need a solution to situations where the process is bound by something else, i.e. CPU. The classic solution to this problem is to send off CPU-bound processes to a worker over a message queue, but that involves significant overhead. So if we assume that there's no downside to making everything asynchronous, then it makes sense to do that--it's not faster for the I/O bound cases, but it's not slower either, and in the minority-but-not-rare CPU-bound case, it gets us a big performance boost.

What this test is doing is challenging the assumption that there's no downside to making everything asynchronous.

In context, I tend to agree with the conclusion that there are downsides. However, those downsides certainly don't apply to every project, and when they do, there may be a way around them. The only lesson we can draw from this is that gaining benefit from coroutines isn't guaranteed or trivial, but there is much more compelling evidence for that out there.


> The point of coroutines is absolutely to make my code execute faster.

I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.

The code runs as fast as it runs, coroutines notwithstanding.


> > The point of coroutines is absolutely to make my code execute faster.

> I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.

Potato potato.


Well, sure, anything can mean anything if you're willing to redefine what words mean.


The meaning of words is determined by usage. Usage of words is determined by the meaning. This circular definition causes the inherent problem of language: words don't have inherent meaning. The best I can do is to attempt to use words in a way similar to the way that you use words, but I can only ever make an educated guess about how you use words, so it's never going to be perfect.

And from my perspective, I don't think it's unreasonable for me to expect you to try to understand what I'm trying to communicate, rather than attempting to force me to use different words. The burden of communication is shared by both speaker and listener.


“Faster” is not a well defined technical term. It is a piece of natural language that can easily refer to max time, mean time, P99, latency, throughput, price per watt, etc. depending on context.


This is not what this article is about.

The surprising conclusion of the article is that on a realistic scenario, the async web frameworks will ouput less requests/sec than the sync ones.

I'm very familiar with Python concurrency paradigms, and I wasn't expecting that at all.

Add to that zzzeek's article (the guy wrote SQLA...) stating async is also slower for db access, this makes async less and less appealing, given the additional complexity it adds.

Now appart from doing a crawler, or needing to support websockets, I find hard to justify asyncio. In fact, with David Beasley hinting that you probably can get away with spawning a 1000 threads, it raises more doubts.

The whole point of async was that, at least when dealing with a lot of concurrent I/O, it would be a win compared to threads+multiprocessing. If just by cranking the number of sync workers you get better results for less complexity, this is bad.


As far as I can tell, the main cost of threads is 2-4MB of memory usage for stack space, so async allows saving memory by allowing one thread to process more than one task. A big deal if you have a server with 1GB of memory and want to handle 100,000 simultaneous connections, like Erlang was designed for. But if the server has enough memory for as many threads that are needed to cover the number of simultaneous tasks, is there still a benefit?


Now the $1000 question would be, if you pay for the context switching of BOTH threads and asyncio, having 5 processes, which each 20 threads, within each an event loop, what happens?

Is the price of the context switching too high, or are you compensating the weakness of each system, by handling I/O concurrently in async, but smoothing the blocking code outside of the await thanks to threads?

Making a _clean_ benchmark for would it be really hard, though.


Anwsering my own comment cause I can't edit it anymore, but this article has started a heated debate on tweeter.

The author of "black" suggested that the cause of the slow down may be that asyncio actually starved postgres for resources:

https://twitter.com/llanga/status/1271719783080366086


> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting. Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

but threads get you the same thing with much less overhead. this is what benchmarks like this one and my own continue to confirm.

People often are afraid of threads in Python because "the GIL!" But the GIL does not block on IO. I think programmers reflexively reaching for Tornado or whatever don't really understand the details of how this all works.


but threads get you the same thing with much less overhead.

That is not true, at least not in general, the whole point of using continuations for async I/O is to avoid the overhead of using threads, the scheduler overhead, the cost of saving and restoring the processor state when switching tasks, the per thread stack space, and so on.


The scheduler overhead and the cost of context-switches are vastly overstated compared to alternatives. The per thread stack space in effect has virtually no run-time cost, and starting off at a single 4k page for a stack, thousands still only waste a miniscule about of memory.


async implementations build a scheduler into the runtime, and that's generally slower than the OS' scheduler. 10-100x slower if it's not in C (or whatever).


GIL might not block on I/O but the implementation that uses PyObject does need the GIL no?


I get enraged when articles like this get upvotes. The evidence given doesn't at all negate the reasoning behind using async, which as you said, is about not having to be blocked by IO, not freaking throughput test for an unrealistic scenario. Just goes to show the complete lack of understanding of the topic. I wouldn't dare write something up if I didn't 100% grasp it, but the bar is way lower for some others it seems.


I don't know the async Python specifics, but from what I understand, you don't necessarily need async to handle large number of IO requests, you can simply use non-blocking IO and check back on it synchronously either in some loop or at specific places in your program.

The use of async either as callbacks, or user threads, or coroutines, is a convenience layer for structuring your code. As I understand, that layer does add some overhead, because it captures an environment, and has to later restore it.


I'm starting to wonder what the origin story is for titles like this. Have CS programs dropped the ball? Did the author snooze through these fundamentals? Or are they a reaction to coworkers who have demonstrated such an educational gap?

Async and parallel always use more CPU cycles than sequential. There is no question. He real questions are: do you have cycles to burn, will doing so brings the wall clock time down, and is it worth the complexity of doing so?


I think it's because "async" has been overloaded. The post isn't about what I thought it would be upon seeing the title.

I was thinking this would be about using multiprocessing to fire off two or more background tasks, then handle the results together once they all completed. If the background tasks had a large enough duration, then yeah, doing them in parallel would overcome the overhead of creating the processes and the overall time would be reduced (it would be "faster"). I thought this post would be a "measure everything!" one, after they realized for their workload they didn't overcome that overhead and async wasn't faster.

Upon what the post was about, my response was more like "...duh".


Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

Waiting for I/O does usually not waste any CPU cycles, the thread is not spinning in a loop waiting for a response, the operating system will just not schedule the thread until the I/O request completed.


Sigh. Async is somewhat orthogonal to parallel.

You are making dinner. You start to boil water for the potatoes. While that happens, you prepare the beef. Async.

You and your girlfriend are making dinner. You do the potatoes, she does the beef. Parallel.

You can perhaps see how you could have asynchronous and parallel execution at the same time.

In the context of a Web server, a request is handled by a single Python process (so don’t give me that “OS scheduler can do other things”). Async matters here because your request turnover can be higher, even if the requests/sec remains the same.

In the cooking example, each request gets a single cook. If that cook is able to do things asynchronously, he will finish a single meal faster.

If it were only parallel, you could have more cooks - because they would be less demanding - but they would each be slower.


> In the cooking example, each request gets a single cook. If that cook is able to do things asynchronously, he will finish a single meal faster.

There is a bit of nuance here, in that the async-chef would make any individual meal slower than a sync-chef, once the number of outstanding requests is large. The sync-chef would indeed have overall higher wait times, but each meal would process just as fast as normal (eg. more like a checkout line at a grocery store).

I prefer the grocery store checkout line metaphor for this reason. If a single clerk was "async" and checking out multiple people at once, all the people in a line would have an average wait time of roughly the same for a small line size. A "sync" clerk would have a longer line with people overall waiting longer, but each individual checkout would take the same amount of time once the customer managed to reached the clerk.

This is pertinent when considering the resources utilized during the job. If an sync clerk only ever holds a single database connection, while an async clerk holds one for every customer they try to check out at the same time, the sync clerk will be far more friendly to the database (but less friendly to the customers, when there aren't too many customers at once).


I think you managed to miss the point: the async chef is doing other stuff necessary to fulfill a single order when he can, i.e., while the potatoes are boiling. The sync chef has to wait for the potatoes to boil, only when those are done can he start to fry the beef.

The sync chef doesn't occupy the frying pan when he's boiling potatoes, so in some sense he only really does as much as he can. Having hundreds of sync chefs would likely be more efficient in terms of order volume, _but not order latency._


I do not disagree with that, my point was just that you are not wasting clock cycles, you may however, as you pointed out, be wasting time while waiting for I/O to complete which you could potentially make better use of by using some more clock cycles while the I/O operation is in progress to do more work which is not dependent on the I/O result.


I didn’t mean to disagree with you, I just wanted to put my take on it out there



> How is this result surprising? The point of coroutines isn't to make your code execute faster, it's to prevent your process sitting idle while it waits for I/O.

It depends on what you mean by "faster". HTTP requests are IO bound, thus it is to be expected that the throughout of a IO bound service benefits from a technology that prevents your process from sitting idle while waiting for IO.

Thus it's surprising that Python's async code performs worse, not better, in both throughput and latency.

> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster"

The findings reported in the blog post you're commenting are the exact opposite of your claim: Python's async performs worse than it's sync counterpart.


We need to stop saying “faster” with regards to async. The point of async was always either fitting more requests per compute resource, and/or making systems more latency consistent under load.

“Faster” is misleading because the speed improvements that you get with async is very dependent on load. At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious.

The one caveat to this is cases where async allows you to run two requests in parallel, rather than sequentially. I would argue that this is less about async than it is about concurrency, and how async work can make some concurrent work loads more ergonomic to program.


you just contradicted yourself:

> “Faster” is misleading

and

> "At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious."

there are no "speed" gains period. the same amount of work will be accomplished in the same amount of time with threads or async. async makes it more memory efficient to have a huge number of clients waiting concurrently for results on slow services, but all of those clients walking off with their data will not be reached "faster" than with threads.

the reason that asyncio advocates say that asyncio is "faster" is based on the notion that the OS thread scheduler is slow, and that async context switches are some combination of less frequent and more efficient such that async is faster. This may be the case for other languages but for Python's async implementations it is not the case, and benchmarks continue to show this.


I did not contradict myself; saying that async is “faster” implies speed gains in all circumstances. In reality the benefits of async io is extremely load dependent, which is why I don’t want to call it “faster”.


The other thing about async is that, in some scenarios, it can make shared resource use clearer - i.e. in a program I've written, the design is such that one type on one thread (a producer) owns the data and passes it to consumers directly, rather than trying to deal with lock-free algorithms and mutexes for sharing the data and suchlike. A multi-threaded ring buffer is much less clearly correct than a single-threaded one.


> but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.

You are not waiting for that 1000ms, and you haven't been for 35 years since the first os's starting feature preemptive multitasking.

When you wait on a socket, the OS will remove you from the CPU and place someone who is not waiting. When data is ready, you are placed back. You aren't wasting the CPU cycles waiting, only the ones the OS needs to save your state.

Actually standing there and waiting on the socket is not a thing people have done for a long time.


> You are not waiting for that 1000ms, and you haven't been for 35 years since the first os's starting feature preemptive multitasking.

The point is that async IO allows your own process/thread to progress while waiting for IO. Preemptive multitasking just assigns the CPU to something else while waiting, which is good for the box as a whole, but not necessarily productive for that one process (unless it is multithreaded).


Sync I/O lets your process (not thread) do something else. In other languages, async I/O is faster because it avoids context switches and amortizes kernel crossings. Apparently this is not the case in practice for python.

This doesn’t surprise me at all, as I’ve had to deal with async python in production, and it was a performance and reliability nightmare compared to the async Java and C++ it interacted with.


>it's to prevent your process sitting idle while it waits for I/O.

...with the goal of making your application faster.


... no. With the goal of allowing concurrency without parallelism.

In doing that, you're removing natural parallelism, and end up competing with the kernel scheduler, both in performance and in scheduling decisions.


This is a lazy argument. We get it, you know what coroutines are and how the kernel scheduler works (also everyone else in this thread).

That doesn't matter though. If you think the average python user is looking for "concurrency without parallelism" with no speed/performance goal in mind, you totally have the wrong demographic.

The fact that the language chose to implement asyncio on a single thread (again the end user doesn't care that this is the case, it could have been thread/core abstraction like goroutines), with little gain, which lead to a huge fragmentation of its library ecosystem is bad. Even worse that it was done in 2018. Doesn't matter how smart you are about the internals.


How in the world did you come to the conclusion that I thought Python users wanted that? I simply concluded that it's the only thing it provides. I wasn't saying it was a good thing, which I think was what you might have read it as.

Python implements things on a single thread due to language restrictions (or rather, reference implementation restrictions), as the GIL as always disallows parallel interpreter access, so multiple Python threads serve little purpose other than waiting for sync I/O. It's been many years since I followed Python development, but back then all GIL removal work had unfortunately come to a halt...


> ...with the goal

> ... no. With the goal

I assumed those meant the end user of the language (it is fair to assume the person you responded to meant that). The goal of the language itself was probably to stay trendy - e.g. JS/Golang/Nim/Rust/etc had decent async stories, where python didn't. Python needed async syntax support as the threading and multiprocessing interfaces were clunky compared to others in the space. What they ended up with arguably isn't good.

I'm pretty familiar with those restrictions which is why I expected this thread to be more of "yeah it sucks that its slower" instead of pulling the "coroutines don't technically make anything faster per se" argument which is distracting.


I see this elitist attitude all over the internet. First it was people saying “Guys why are you over reacting to corona the flu is worse.”

Then it was people saying “Guys, stop buying surgical masks, The science says they don’t work it’s like putting a rag over your mouth.”

All of these so called expert know it alls were wrong and now we have another expert on asynchronous python telling us he knows better and he’s not surprised. No dude your just another guy on the internet pretending he’s a know it all.

If you are any good, you’ll realize that nodejs will beat the flask implementation any day of the week and the nodejs model is exactly identical to the python async model. Nodejs blew everything out of the water, and it showed that asynchronous single threaded code was better for exactly the test this benchmark is running.

It’s not obvious at all. Why is the node framework faster then python async? Why can’t python async beat python sync when node can do it easily? What is the specific flaw within python itself that is causing this? Don’t answer that question because you don’t actually know man. Just do what you always do and wait for a well intentioned humble person to run a benchmark then comment on it with your elitist know it all attitude claiming your not surprised.

Is there a word for these types of people? They are all over the internet. If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.


> Nodejs blew everything out of the water

Node's JIT comes from a web browser's javascript implementation used by billions of people. It's also had async baked in from day one.

Python started single process, added threading, and then bolted async on top of that. And CPython is a pretty straight interpreter.

A comparison between Node and PyPy would be more informative, but PyPy has a far less mature JIT and still has to deal with Python's dynamism.

> If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.

You can't lecture people into self-awareness, any more than experts can lecture everyone into wearing masks.


Except IO is the bottleneck here. The concurrency model for IO should determine overall speed. If python async is slower for IO tasks then sync then that IS an unexpected result and an indication of a python specific problem.


> Except IO is the bottleneck here.

If you say IO is the bottleneck, then you're claiming there is no significant difference between python and node. That's what a bottleneck means.

> The concurrency model for IO should determine overall speed.

"Speed" is meaningless, it's either latency or throughput. Yeah, yeah, sob in your pillow about how mean elites are, clean up your mascara, and learn the correct terminology.

We've already claimed the concurrency model is asynchronous IO for both python and node. Since they are both doing the same basic thing, setting up an event loop and polling the OS for responses, it's not an issue of which has a superior model.

> If python async is slower for IO tasks then sync then that IS an unexpected result and an indication of a python specific problem.

Both sync and async IO have their own implementations. If you read from a file synchronously, you're calling out to the OS and getting a result back with no interpreter involvement. This[2] is a simple single-threaded server in C. All it does is tell the kernel, "here's my IO, wake me up when it's done."

When you do async work, you have to schedule IO and then poll for it. This[1] is an example of doing that in epoll in straight C. Polling involves more calls into the kernel to tell it what events to look for, and then the application has to branch through different possible events.

And you can't avoid this if you want to manage IO asynchronously. If you use synchronous IO in threading or processes, you're still constructing threads or processes. (Which makes sense if you needed them anyway.)

So unless an interpreter builds its synchronous calls on top of async, sync necessarily has less involvement with both the kernel and interpreter.

The reason the interpreter matters is because the latency picture of async is very linear:

* event loop wakes up task * interpreter processes application code * application wants to open / read / write / etc * interpreter processes stdlib adding a new task * event loop wakes up IO task * interpreter processes stdlib checking on task * kernel actually checks on task

Since an event loop is a single-threaded operation, each one of these operations is sequential. Your maximum throughput, then, is limited by the interpreter being able to complete IO operations as fast as it is asked to initiate them.

I'm not familiar enough with it to be certain, but Node may do much of that work in entirely native code. Python is likely slow because it implements the event loop in python[3].

So, not only is Python's interpreter slower than Node's, but it's having to shuffle tasks in the interpreter. If Node is managing a single event loop all in low level code, that's less work it's doing, and even if it's not, Node can JIT-compile some or all of that interpreter work.

[1]: https://github.com/o0myself0o/epoll/blob/master/epoll.c

[2]: https://www.programminglogic.com/example-of-client-server-pr...

[3]: https://github.com/python/cpython/blob/3.8/Lib/asyncio/unix_...


>If you say IO is the bottleneck, then you're claiming there is no significant difference between python and node. That's what a bottleneck means.

This is my claim that this SHOULD be what's happening under the obvious logic that tasks handled in parallel to IO should be faster then tasks handled sequentially and under the assumption that IO takes up way more time then local processing.

Like I said the fact that this is NOT happening within the python ecosystem and assuming the axioms above are true, then this indicates a flaw that is python specific.

>The reason the interpreter matters is because the latency picture of async is very linear:

I would say it shouldn't matter if done properly because the local latency picture should be a fraction of the time when compared to round trip travel time and database processing.

>Python is likely slow because it implements the event loop in python

Yeah, we're in agreement. I said it was a python specific problem.

If you take a single task in this benchmark for python. And the interpreter spends more time processing the task locally then the total Round trip travel time and database processing time... Then this means the database is faster than python. If database calls are faster then python then this is a python specific issue.


You're making the classic mistake of assuming a common thread connects the people who've annoyed you in various unrelated contexts.


I mean no one even mentioned node. Maybe it is faster idk. But we're talking about python?


His async code creates a pool with only 10 max connections[1] (the default). Whereas his sync pool[2], with a flask app that has 16 workers, has significantly more database connections.

I expect upping this number would have a positive effect on asyncio numbers because the only thing[3] this[4] is[5] measuring[6] is how many database connections you have, and is about as far from a realistic workload as you can get.

Change your app to make 3 parallel requests to httpbin, collect the responses and insert them into the database. That's an actually realistic asyncio workload rather than a single DB query on a very contested pool. I'd be very interested to see how sync frameworks fare with that.

1. https://github.com/calpaterson/python-web-perf/blob/master/a...

2. https://github.com/calpaterson/python-web-perf/blob/master/s...

3. https://github.com/calpaterson/python-web-perf/blob/master/a...

4. https://github.com/calpaterson/python-web-perf/blob/master/a...

5. https://github.com/calpaterson/python-web-perf/blob/master/a...

6. https://github.com/calpaterson/python-web-perf/blob/master/a...


Hi - as mentioned in the article all connections went through pgbouncer (limited to 20) and I was careful to ensure that all configurations saturated the CPU so I'm pretty confident they were not waiting on connections to open. Opening a connection from pgbouncer over a unix socket is very fast indeed - my guess is perhaps a couple of orders of magnitude faster than without it. 20 connections divided by 4 CPUs is a lot, and pretty much all CPU time was still spent in Python.

Sidenote here: one thing I found but didn't mention (the reason I put in the pooling, both in Python and pgbouncer) is that otherwise, under load, the async implementions would flood postgres with open connections and everything would just break down.

I think making a database query and responding with JSON is a very realistic workload. I've coded that up many times. Changing it to make requests to other things (mimicking a microservice architecture) is also interesting and if you did that I'd be interested to read your write up.


Aren't you still capping the throughput by the query rate of your connection pool though? By limiting that, you are limiting the application as a whole - i.e. your benchmark is bound by the speed of your database, and has (almost) nothing to do with the performance of a specific python implementation.


Only if there are spare resources left to saturate the connection pools, which didn't seem to be the case.

If the system as a whole is well saturated, and the python processes dominate the system load with a DB load proportional to the requests served, then I don't think we would hit any external bottlenecks.

The benchmarks performed are not that great (e.g., virtualized, same machine for all components, etc.), but I don't think the errors are enough to throw off the result. Note, of course, that such results are not universal, and some loads might perform better async.


If the benchmark is bound by the database speed, wouldn't the expected result be that all implementations returned roughly the same number of requests per second?


>Sidenote here: one thing I found but didn't mention (the reason I put in the pooling, both in Python and pgbouncer) is that otherwise, under load, the async implementions would flood postgres with open connections and everything would just break down.

Doesn't this prove that async is waiting for connections when you put a limit on it? The only way async wins is if it is free to hit the db whenever it needs to.


But why async is spending so much CPU if it just waits?


Who knows. The point is, if when not restricted you get a ton of db connections, then any restriction on that almost definitely means you are imposing a bottle neck. The only way this would not be the case is if it was trying to create db connections when it didn't need them, unlikely.


So the CPU and database are the bottlenecks not async Python.


The benchmark is certainly flawed, but I don't see how you can jump to that conclusion.


> His async code creates a pool with only 10 max connections[1] (the default). Whereas his sync pool[2], with a flask app that has 16 workers, has significantly more database connections.

And the reasoning is explained in the article:

"The rule I used for deciding on what the optimal number of worker processes was is simple: for each framework I started at a single worker and increased the worker count successively until performance got worse."


That is talking about WSGI worker processes. OP is talking about database pool connections. They are not the same thing.


Seems many commenters missed this statement. It's also troubling how common it is to hear assertions that async is king especially on projects where your future scale is unknown. Based on https://web.archive.org/web/20160203172420/https://www.maili... presentation, it looks like there is a stronger case for a sync model as the default.


> Change your app to make 3 parallel requests to httpbin, collect the responses and insert them into the database. That's an actually realistic asyncio workload

I don't see how that is a more "realistic" asyncio workload.

It might be a workload that async is better suited for, but the point of the article is to compare async web frameworks, which will often be used just to fetch and return some data from the db.

If you had an endpoint which needed to fetch 3 items from httpbin and insert them in the db it may make sense to use asyncio tools for that, even within the context of a web app running under a sync framework+server like Falcon+Gunicorn.

In my experience Python web apps (Django!) often spend surprisingly little time waiting on the db to return results, and relatively a large amount of time cpu-bound instantiating ORM model instances from the db data, then transforming those instances back into primitive types that can be serialized to JSON in an HTTP response. In that context I am not surprised if sync server with more processes is performing better. In this test it's not even that bad... the 'ORM' seems to be returning just a tuple which is transformed to a dict and then serialized.


This raises an interesting point - async is less well suited to languages that are, well, slow as molasses. If your language is so slow that basic operations dominate even network IO, you're not going to gain much.


I should add that when I said above "I am not surprised if sync server with more processes is performing better"... that's only after reading this article and thinking about it

until then I'd had pretty much bought the hype that the new async frameworks running on Uvicorn were the way to go

I'm very glad to see this kind of comparative test being made, it's very useful, even if it later gets refined and added to and the results more nuanced


He only has 4 CPUs. I doubt rising the worker count is going to help the async situation. From my experience it’s really hard to make async outperform sync when databases are involved because the async layer adds so much overhead. Only when you are completely io bound with lots of connections does async outperform sync in python.


> From my experience it’s really hard to make async outperform sync when databases are involved because the async layer adds so much overhead

Highly disagree as the database is just another IO connection to a server, which is asyncio bread and butter. Being able to stream data from longer running queries without buffering and whilst serving other requests (and making other queries) is really quite powerful.

But yeah, if you're maxing out your database with sync code then async isn't going to make it magically go faster.


Hi, take a look at my benchmarks from five years ago at https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a.... The extra variable with Python is that it's a very CPU-heavy interpreted language and it's really unlikely for an application to be significantly IO bound with a database server on the same network within the realm of CRUD-style queries. asyncio was significantly slower than threads (noting they've made performance improvements since then) and gevent was about the same (which I'm pretty sure is close to as fast as you can get for async in Python).


The database is mostly just idle IO. You send a query and then you wait for results. That’s something sync python is decent at because when you wait for that IO the GIL is released. The situation is different if there is a lot of activity on the epoll/kqueue etc. (connects, data ready etc.).


Apologies - I completely misread your initial comment. Yeah that's correct.

Despite this I think it's quite rare to hit this limit, at least in the orchestration-style use cases I use asyncio for. With those I value making independent progress on a number of async tasks rather than potentially being blocked waiting for a worker thread to become available.


Before you criticize the article, you should read it. He wrote a whole section about the specific worker numbers and why and how he choose them.


On top of that, the author uses aiopg rather than asyncpg[1] for the async database operations, even though asyncpg is (allegedly) a whole lot faster.

1. https://github.com/MagicStack/asyncpg


asyncpg is not scalable. It can only do "session pooling" because it needs advisory_locks, listen/notify, which will end up needing a lot of Postgresql connections.


Can you share more information on this (articles, etc)?


There is no 1 article to explain but you can research each part.

1. One Postgresql connection is a forked process and has memory overhead (4MB iirc) + context switching.

2. A connection can only execute 1 concurrent query (no multiplexing).

3. Asyncpg to be fast, uses the features that I mentioned in my parent post. Those can only be used in Session Pooling https://www.pgbouncer.org/features.html.

The whole point of async is to some other work while waiting for a query (ex a different query).

If you have 10 servers with 16 cores, each vcore has 1 python process, each python process doing 10 simultaneous queries. 10 * 16 * 10 = 1600 opened connections.

The best way IMHO: Is to use autocommit connections. This way your transactions execute in 1 RPC. You can keep multiple connections opened with very light CPU and pooling is best.

I've done 20K short lived queries/second from 1 process with only ~20 connections opened in Postgresql (using Pgbouncer statement pooling).


Absolutely agree, sum to this the quality of the driver aiopg vs asyncpg...


I am SUPER happy someone else is finally looking at this. It is long past time that the reflexive use of asycnio or systems like gevent/eventlet for no other reason than "hand-wavy SPEED" come to an end. That web applications that literally serve just one user at at time are built in Tornado for "speed". (my example for this is the otherwise excellent SnakeViz: https://jiffyclub.github.io/snakeviz/ which IMO should have just used wsgiref).

As the blog post apparently cites as well (woo!), I've written about the myth of "async == speed" some years ago here and my conclusions were identical.

https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a...


Hi - yes loved your blogpost! Also very tired of the "async magic performance fairy dust" :)

It's a difficult myth to dispel and I think the situation in terms of public mindshare is much worse now than it was in 2015. Some very silly claims from the async crowd now have basically widespread credence. I think one of the root causes is that people are sometimes very woolly about how multi-processing works. One of the others is that I think it's easy to make the conceptual mistake of 1 sync workers = 1 async worker and do a comparison that way

One of my worries is that right now it feels like everything in Python is being rewritten in asyncio and the balkanisation of the community could well be more problematic than 2 vs 3.


> One of my worries is that right now it feels like everything in Python is being rewritten in asyncio and the balkanisation of the community could well be more problematic than 2 vs 3.

this is exactly why the issue is so concerning for me as well.


> I think the situation in terms of public mindshare is much worse now than it was in 2015

Ok in 2015 it was a pain but with Python 3.8 it's actually a only joy & fun in my opinion.

> the balkanisation of the community could well be more problematic than 2 vs 3

If you could call Python2 code from Python3 or vice-versa as easily as you can do with async then it would be comparable.


For me it's worth the effort to deal with async if it means not having to deal with uwsgi or other frontends. But in general I think Python has too many problems (packaging, performance, distribution, etc) that it doesn't make sense IMO to invest in new Python projects.


uWSGI is a lot of joy for me, really, I've never been happier with my deployments since I have discovered uWSGI back in 2008 or something, and nowadays it supports plenty of languages so there's just nothing I don't deploy on uWSGI anymore.

Python packaging is something that I have fully automated (maintaining over 50 packages here) and that I'm pretty happy with.

I fail to see the problem with Python packaging, maybe because I have an aggressive continuous integration practice ? (always integrate upstream changes, contribute to dependencies that I need, and when I'm not doing TDD it's only because I have not yet proof that the code I'm writing is not actually going to be useful) That's not something everybody wants to do (I don't understand their reasoning though).

People would rather freeze their dependencies and then cry because upgrading is a lot of work, instead of upgrading at the rhythm of upstream releases. If other packages managers or other languages have packaging features that encourages what I consider to be non-continuous integration then good for them, but that's not how a hacker like me wants to work, being able to "ignore upstream releases" is not a good feature, it made me a sad developer really, "ignoring non-latest releases" have made me a really happy developer.

Most performance issues are not imputable to the language. If they are, it's probably not affecting all your features, you can still rewrite the feature that Python is not well performing for into a compiled language. I need most of my code to be easy to manipulate, and very little of it to actually outperform Python.

I've recently re-assessed if I should keep going with Python for another 10 years, tried a bunch of languages, frameworks, at the end of the month I still wanted a language that easy to manipulate with basic text tools, that's sufficiently easy so that I can onboard junior collegues on my tools, that provides sufficiently advanced OOP because I find it efficient to structure and reuse code.

Python does what it claims, it solves a basic human-computer problem, let's face it: it's here to stay and shine, and its wide ecosystem seems like a solid proof. Wether it makes sense to invest in a project or not should not depend in the language anyway.


> uWSGI is a lot of joy for me, there's nothing I don't deploy on uWSGI, even PHP code.

Oh man, we moved away from uwsgi to async a couple of years ago and that's been one of the best decisions we've made. Async is no walk in the park, but not having to deal with uwsgi configuration, etc has been well worth it.

> Python packaging is something that I have fully automated (maintaining over 50 packages here) and that I'm pretty happy with.

Yeah, I don't doubt this. Many people have found a happy path that works for them, but I've found that those tend to be people who don't have significant constraints (e.g., they don't need fast builds, or they don't care about reproducibility, or they don't have to deal with a large number of regular contributors, or etc).

> Most performance issues are not imputable to the language.

This isn't true in a meaningful sense. For the most part, if you're doing anything more complicated than a CRUD app, you will run into performance problems with Python almost immediately upon leaving the prototype phase, and your main options for improving performance are horizontal scaling (multiprocess/multihost parallelism) or rewriting the hot path in a faster language. As previously discussed, these options only work for certain use cases where the ratio of de/serialization to real work is low, so you often find yourself without options. Further, horizontal scaling is expensive (compute is expensive) and rewriting in a different language is differently expensive (you now have to integrate a separate build system and employ developers who are not only well-versed in the new language, but also in implementing Python extensions specifically).

On the other hand, if you chose a language like Go, you would be in the same ballpark of maintainability, onboarding, etc (many would argue Go is easier to write and maintain due to simplicity and static typing) but you would be in a much better place with respect to packaging and performance. You likely wouldn't need to optimize anything since naive Go tends to be 10-100X faster than naive Python, and if you needed to optimize, you can do so in-language without paying any sort of de/serialization overhead (parallelism, memory management, etc), allowing you to eek out another magnitude of performance. There are other options besides Go that also give performance gains, but they often involve trading off simplicity/packaging/deployment/tooling/ecosystem/etc.

> If they are, it's probably not affecting all your features, you can still rewrite the feature that Python is not well performing for into a compiled language.

This is true, but "rewriting features" is usually prohibitively expensive, and it's often non-trivial to figure out up-front which features will have performance problems in the future such that you could otherwise avoid a rewrite.

> Python does what it claims that's a basic human problem, let's face it: it's here to stay and shine.

Yes, Python is here to stay, but that's more attributable to network effects and misinformation than merit in my experience.


Well we can't use uWSGI for ASGI but still good for us for anything else, I literally have 0 uWSGI configuration file, just a uWSGI command in a container command.

> Many people have found a happy path that works for them, but I've found that those tend to be people who don't have significant constraints (e.g., they don't need fast builds, or they don't care about reproducibility, or they don't have to deal with a large number of regular contributors, or etc).

I'm really curious about this statement, building a python codebase for me means building a container image, if the system packages or python dependencies don't change then it's really going to take less than a minute. What does your build look like ?

Can you define "a large number of regular contributors".

What do you mean "they don't need reproductibility" ? I suppose they just build a container image in a minute and then go over and deploy on some host. If a dependency breaks the code, it's still reproductible, but broken, then it means it has to be fixed, rather than ignored, a temporary version pin is fine though.

> This is true, but "rewriting features" is usually prohibitively expensive, and it's often non-trivial to figure out up-front which features will have performance problems in the future such that you could otherwise avoid a rewrite.

If Go is so much easier to write then I fail to see how it can be a problem to use Go to rewrite a feature for which performance is mission critical, and for which you have final specifications in the python implementation you're replacing. But why write it in Go instead of Rust, Julia, Nim, or even something else ?

You're going to choose the most appropriate language for what exactly you have to code. If you're trying to outperform an interpreted language and/or don't care about being stuck with a rudimentary pseudo-object oriented feature set then choose such a compiled language. Otherwise, Python is a pretty decent choice.

> Yes, Python is here to stay, but that's more attributable to network effects and misinformation than merit in my experience.

If Go was easier to write and read, why would they implement a Python subset in Go for configuration files, instead of just having configuration files in Go ? go.starlark.net Oh right, because it's not as easy to read and write than Python, and because you'd need to recompile. So apparently, even Google who basically invented also seem to need it to support some Python dialect.

10-100X performance is most probably something you'll never need when starting a project, unless performance is mission critical from the start. Static types and compile is an advantage for you, but for me dynamic typing and interpretation means freedom (again, I'm going to TDD on one hand and fix runtime exceptions as soon as I see them in applicative monitoring anyway).

I don't believe comparing Python and Go is really relevant, comparing PHP and Ruby and Python for example would seem more appropriate, when you say "people shouldn't need Python because they have Go" I fail to see the difference with just saying "people shouldn't need interpreted languages because there are compiled languages".

Humans need a basic programing language that is easy to write and read, without caring about having to compile it for their target architecture, Python claims to do that, and does it decently. If you're looking for more, or something else, then nobody said that you should be using Python.

I might be wrong, but when I'm talking about Humans, I'm referring to, what I have seen during the last 20 years as 99% of the projects out there in the wild, not the 1% of projects that have extremely specific mission critical performance requirements, thousands of daily contributors, and the like. Those are also pretty cool, and they need pretty cool technology, but it's really not the same requirements. For me saying everybody needs Go would look a bit like saying everybody needs k8s or AWS. Languages are many and solve different purpose. The one that Python serves is staying, not by misinformation, but because of Human nature.


> What does your build look like ?

Running tests, building a PEX file, putting the PEX file into a container image. We have probably about a dozen container images and counting at this point. The tests take a long time (because Python is 2+ orders of magnitude slower than other languages), and our CI bill is killing us (we're looking into other CI providers as well).

> Can you define "a large number of regular contributors".

More than 20 (although our eng org is 30-50). Multiple teams. You don't want to hold everyone's hand and show them all the tips and tricks you've found for working around the quirks of Python packaging or give them an education on wheels, bdists, sdists, virtualenvs, pipenvs, pyenvs, poetries, eggs, etc. They were promised Python was going to be easy and they wouldn't have to learn a bunch of things, after all.

> What do you mean "they don't need reproductibility" ? I suppose they just build a container image in a minute and then go over and deploy on some host.

Container images aren't reproducible in practice. Moreover, they have to also be reproducible for local development, and we use macs and Docker for mac is prohibitively slow. Need something else to make sure developers aren't dealing with dependency hell.

> If Go is so much easier to write then I fail to see how it can be a problem to use Go to rewrite a feature for which performance is mission critical, and for which you have final specifications in the python implementation you're replacing.

Both can be true: Go is easier to write than Python and it's still prohibitively expensive to rewrite a whole feature in Go. If the feature is small, well-designed, and easily isolated from the rest of the system, then rewriting is cheap enough, but these cases are rare and "opportunity cost" is a real thing--time spent rewriting is time not spent building new features.

> But why write it in Go instead of Rust, Julia, Nim, or even something else ?

Because Rust slows development velocity by an order of magnitude and Julia and Nim aren't mature general-purpose application development languages.

> You're going to choose the most appropriate language for what exactly you have to code. If you're trying to outperform an interpreted language and/or don't care about being stuck with a rudimentary pseudo-object oriented feature set then choose such a compiled language. Otherwise, Python is a pretty decent choice.

Yes, you have to choose the most appropriate language, but I contend that Python is a pretty rubbish choice for reasons that people often fail to consider up front. E.g., "My app will never need to be fast, and if it's fast I can just rewrite the slow parts in C!".

> If Go was easier to write and read, why would they implement a Python subset in Go for configuration files, instead of just having configuration files in Go ? go.starlark.net Oh right, because it's not as easy to read and write than Python, and because you'd need to recompile. So apparently, even Google who basically invented also seem to need it to support some Python dialect. Starlark is pretty cool though, and I use it a lot; I just wish it were statically typed.

Apples and oranges. Starlark is an embedded scripting language, not an app dev language. Different design goals. It also probably pre-dates Go, or at least derives from something which pre-dates Go.

> 10-100X performance is most probably something you'll never need when starting a project, unless performance is mission critical from the start.

You would be surprised. As soon as you're doing something moderately complex with a small-but-not-tiny data set you can easily find yourself in the tens of seconds. And 100X is the difference between a subsecond request and an HTTP timeout. It matters a lot.

> Static types and compile is an advantage for you, but for me dynamic typing and interpretation means freedom (again, I'm going to TDD on one hand and fix runtime exceptions as soon as I see them in applicative monitoring anyway).

We do TDD for our application development too and we still see hundreds of typing errors in production every week. I think your idea of "static typing" is jaded by Java or C++ or something; you can have fast, flexible iteration cycles with Go or many of the newer classes of statically typed languages, as previously mentioned. "Type inference" (in moderation) is your friend. Anyway, Go programs can often compile in the time it takes a Python program to finish importing its dependencies. A Go test can complete in a fraction of the time it takes for pytest to start testing (no idea why it takes so long for it to find all of the tests).

> I don't believe comparing Python and Go is really relevant, comparing PHP and Ruby and Python for example would seem more appropriate, when you say "people shouldn't need Python because they have Go" I fail to see the difference with just saying "people shouldn't need interpreted languages because there are compiled languages".

"compiled" and "interpreted" aren't use cases. "General app dev" is a use case. Python and Go compete in the same classes of tools: web apps, CLI applications, devops automation, lambda functions, etc. PHP and Ruby are also in many of these spaces as well. I don't especially care if Python is the fastest interpreted language (it's not by a long shot), I care if it's fast enough for my application (it's not by a long shot).

> Humans need a basic programing language that is easy to write and read, without caring about having to compile it for their target architecture, Python claims to do that, and does it decently. If you're looking for more, or something else, then nobody said that you should be using Python.

Lots of people recommend Python for use cases for which it's not well suited, and since so many Python dependencies are C, you absolutely have to worry about recompiling for your target architecture, and it's much, much harder than with Go (to recompile a Go project for another architecture, just set the OS and the architecture via the `GOOS` and `GOARCH` env vars and rerun `go build`--you'll have a deployable binary before your Python Docker image finishes building).

> I might be wrong, but when I'm talking about Humans, I'm referring to, what I have seen during the last 20 years as 99% of the projects out there in the wild, not the 1% of projects that have extremely specific mission critical performance requirements

Right, Python is alright for CRUD apps or any other kind of app where the heavy lifting can easily be off-loaded to another language. There's still the build issues and everything else to worry about, but at least performance isn't the problem. But I think you'll be surprised to find out that lots of apps don't fit that bill.

> For me saying everybody needs Go would look a bit like saying everybody needs k8s or AWS.

I'm not saying everyone needs Go, I'm saying that Go is a better Python than Python. There are a handful of exceptions--there's not currently a solid Go-alternative for django, and I wouldn't be surprised if the data science ecosystem was less mature. But for general purpose development, I think Go beats Python at its own game. And I've been playing that game for a decade now. This conversation has been pretty competitive, but I really encourage you to give Go a try--I think you'll come around eventually, and you can learn it so fast that you can be writing interesting programs with it in just a few hours. Check out the tour: https://tour.golang.org.


I understand that if you're building a PEX file then all dependencies must be reinstalled into it every time, however you might still be able to leverage container layer caching to save the download time.

CI bills are aweful, I always deploy my own CI server, a gitlab-runner where I also spawn a Traefik instance to practice eXtreme DevOps.

More than 20 daily contributors that's nice, but I must admit that I have contributed to some major python projects that don't have a packaging problem, such as Ansible or Django. So, I'm not sure if the number of contributors is really a factor in packaging success. That said, sdist and well are things that happen in CI for me, it's just adding this to my .gitlab-ci.yml:

    pypi:
        stage: deploy
        script: pypi-release
And adding TWINE_{USERNAME,PASSWORD} to CI. The other trick is to use the excellent setupmeta or something like that (OpenStack also has a solution) so that setup.py discovers the version based on the git tag or publishes a dev version.

That's how I automate the packaging of all my Python packages (I have something similar for my NPM packages). As for virtualenvs, it's true that they are great but I don't use them, I use pip install --user, which has the drawback that you need all your software to run with the latest releases of dependencies, otherwise you have to contribute the fixes, but I'm a more happy developer this way, and my colleagues aren't blocked by a breaking upstream release very often, they will just pin a version if they need to keep working while somebody takes care of changing our code and contribute to dependencies to make everything work with latest versions.

I don't think that other languages are immune to version compatibility issues, I don't think that problem is language dependent, either you pin your versions and forget about upstream releases, either you aggressively integrate upstream releases continuously in your code and your dependencies.

> My app will never need to be fast

I maintain a governmental service that was in production in less than 3 months, then 21 months of continuous development, serving 60m citizen with a few thousand administrators, as sole techie, on a single server, for the third year. Needless to say, my country has never seen such a fast and useful project. I have not optimized anything. Of course you can imagine it's not my first project in this case. For me, Python's speed most often not a problem is not a lie, I proved it.

The project does have a slightly complex database, the administration interface does implement really tight permission granularity (each department has its own admin team with users of different roles), it did have to iterate quickly, but you know the story with Django : changing the DB schema is easy, migrations are generated by Django, you can write data migrations easily, tests will tell you what you broke, you write new tests (I also use snapshot testing so a lot of my tests actually write themselves), and upgrading a package is just as easy as fixing anything that broke when running the tests.

You seem to think that Python is outdated because it's old, and that's also what I thought when I went over all alternative for my 10 next years of app devs. I was ready to trash all my Python really. But that's how I figured that the human-computer problem Python solves will just always be relevant. I'll assume that you understand the point I made on that and that we simply disagree here.

Or maybe we don't really disagree, I'll agree with you that a compiled language is better for mission-critical components, but any of these will almost always need a CRUD and that's where Python shines.

But I've not always been making CRUDs with Python, I have 2 years of experience as an OpenStack developer, and I must admit that Python fit the bill pretty well here too. Maybe my cloud company was not big enough to have problems, or we just avoided the common mistakes. I know people like Rackspace had hard times maintaining forks of the services, I was the sole maintainer of 4 network services rewrites which were basically 1 package using OpenStack as a framework (like I would use Django), to simply listen on RabbitMQ and do stuff on SDN and SSH. Then again, I think not so much people actually practice CI/CD correctly, so that's definitely going to be a problem for them at some point.

> there's not currently a solid Go-alternative for django

That's one of the things that put me of, I tried all Go web frameworks, and they are pretty cool, but will they ever reach the productivity levels of Django, Rails or Symfony ?

Meanwhile, I'm just waiting for the day someone puts me in charge of something where performance is sufficiently performance-critical that I need to rewrite it in a compiled language, if I could have the chance to do some ASM optimizations that would also be a lot of fun. Another option is that I have something to contribute to a Go project, but so far, Go developers seem doing really fine without me for sure :)

While I choose it for general purpose development ? I guess I'm stuck with "I love OOP" just like "the little functional programing Python offers".

I really enjoyed this conversation too, would like to share it on my blog if you don't mind, thank you for your time, have a great weekend.


This is kind of weird to me though. All this effort being spent to argue what is effectively a strawman belief only held by people who don't fully understand what they believe.


> hand-wavy SPEED

In general, you can get higher throughput with asyncio because you don't have context switches, but it comes at the cost of latency. So hand-wavy, indeed. It really depends what sort of speed you're after.


This is true as far as it goes, but is not testing the (very common) areas where async shines.

Imagine you're loading a profile page on some social networking site. You fetch the user's basic info, and then the information for N photos, and then from each photo the top 2 comments, and for each comment the profile pic of the commentor. You can't just fetch all this in one shot because there's data dependencies. So you start fetching with blocking IO, but that makes your wait time for this request proportional to the number of fetches, which might be large.

So instead, you ideally want your wait to be proportional to the depth of your dependency tree. But composing all these fetches that way is hard without the right abstraction. You can cobble it together with callbacks but it gets hairy fast.

So (outside of extreme scenarios) it's not really about whether async is abstractly faster than sync. It's about how real developers would solve the same problem with/without async.

(Source: I worked on product infrastructure in this area for many years at FB)


I felt baffled by this thread until I read this response. async/await for me has always been about managing this kind of dependency nightmare. I guess if all you have to do is spawn 100 jobs that run individually and report back to some kind of task manager then the performance gains of threads probably beats async/coroutine based approaches on a pure speed benchmark. But when I have significant chains of dependent work then the very idea of using bare threads and callbacks to manage that is annoying.

At least in Typescript nowadays, the ability to just mark a function `async` and throw an `await` in front of its invocation drastically lowers the barrier to moving something from blocking to non-blocking. In the same cases if I had to recommend the same change with thread pools and callbacks (and the manual book-keeping around all that) most developers just wouldn't bother.


>... the very idea of using bare threads and callbacks to manage that is annoying.

Yeah, that's an extremely painful way to write threaded code. Much more normal is to simply block your thread while waiting for others to .Join() and return their results, likely behind an abstraction layer like a Future.

The only time you really need to use callbacks is when you need to blend async and threaded code, and you aren't able to block your current thread (e.g. Android main thread + any thread use is an example of this). But there are much much easier ways to deal with that if you need to do it a lot - put your primary logic in a different, blockable thread.


> just mark a function `async` and throw an `await` ... to [move] something from blocking to non-blocking.

That's not how it works. `async` and `await` are merely syntactic sugar around callbacks. Everything in javascript is already nonblocking[1], whether or not you use async/await.

[1] There are a few rare exceptions in node js (functions suffixed with "Sync"), but in the same vein, they are blocking whether or not you use async/await.


The argument was about the developer experience, not how things work behind the scenes. It's super simple for a developer to write this, for example:

    const a = an async operation
    const b = another async operation
    // Resolve a and b concurrently
    const [x, y] = await Promise.all([a, b])
    // Do something with x and y
You can naturally achieve that with callbacks but there's more boilerplate involved. I'm not familiar with Python so I don't know how it would look like without async.

Edit: I just re-read your comment and the one you were responding to, and do agree that async/await don't "move" things from blocking to non-blocking. It just helps using already non-blocking resources more easily. It will not help you if you're trying to make a large numerical computation asynchronous, for example. In this regard it's very different from Golang's `go`, which will run the computation in a separate goroutine, which itself will run concurrently (with Go's scheduler deciding when to yield), and in parallel if the environment allows it.


As someone who works in both Python and JavaScript regularly, JS’s async is just leagues easier and better. It’s night and day. Even something as simple as new Promise or Promise.all is way more confusing in Python. It’s very different.


A lot of the debate and discussion here seems to come from the fact that the example program demonstrates concurrency across requests (each concurrent request is being handled by a different worker), but no concurrency within each request: The code to serve each request is essentially one straight line of execution, which pauses while it waits for a DB query to return.

A more interesting example would be a request that requires multiple blocking operations (database queries, syscalls, etc.). You could do something like:

    # Non-concurrent approach
    def handle_request(request):
      a = get_row_1()
      b = get_row_2()
      c = get_row_3()
      return render_json(a, b, c)
   

    # asyncio approach
    async def handle_request(request):
      a, b, c = await asyncio.gather(
        get_row_1(),
        get_row_2(),
        get_row_3())
      return render_json(a, b, c)

    # Naive threading approach
    def handle_request(request):
       a_q = queue.SimpleQueue()
       t1 = threading.Thread(target=get_row_1(a_q))
       t1.start()
       b_q = queue.SimpleQueue()
       t2 = threading.Thread(target=get_row_2(b_q))
       t2.start()
       c_q = queue.SimpleQueue()
       t3 = threading.Thread(target=get_row_3(c_q))
       t3.start()

       t1.join()
       t2.join()
       t3.join()

       return render_json(a_q.get(), b_q.get(), c_q.get())


    # concurrent.futures with a ThreadPoolExecutor 
    def handle_request(request, thread_pool):
      a = thread_pool.submit(get_row_1())
      b = thread_pool.submit(get_row_2())
      c = thread_pool.submit(get_row_3())
      return render_json(a.result(), b.result(), c.result())
These examples demonstrate what people find appealing about asyncio, and would also tell you more about how choice of concurrency strategy affects response time for each request.


This a great point, surprised you received no follow-up comments!


Is speed really a good reason for using async? If I remember correctly, asynchronous I/O was introduced to deal with many concurrent clients.

Therefore, I would have liked to see how much memory all those workers use, and how many concurrent connections they can handle.


I think speed is the wrong word here. A better word is throughput.

The underlying issue with python is that it does not support threading well (due to the global interpreter lock) and mostly handles concurrency by forking processes instead. The traditional way of improving throughput is having more processes, which is expensive (e.g. you need more memory). This is a common pattern with other languages like ruby, php, etc.

Other languages use green threads / co-routines to implement async behavior and enable a single thread to handle multiple connections. On paper this should work in python as well except it has a few bottlenecks that the article outlines that result in throughput being somewhat worse than multi process & synchronous versions.


I think 'scalability' is the best word here.

Taken from Stephen Cleary's SO answer on this topic: https://stackoverflow.com/a/31192718


> which is expensive (e.g. you need more memory)

Memory is cheap; the cost is in constant de/serialization. Same with "just rewrite the hotspots in C!"-style advice; de/serialization can easily eat anything you saved by multiprocessing/rewriting. Python is a deceivingly hard language, and a lot of this is a direct result of the "all of CPython is the public C-extension interface!" design decision (significant limitations on optimizations => heavy dependency on C-extensions for anything remotely performance sensitive => package management has to deal extensively with the nightmare that is C packaging => no meaningful cross-platform artifacts or cross compilation => etc).


Memory is not cheap when dealing the real world cost of deploying a production system. The pre fork worker model used in many sync cases is very resource intensive and depending on the number of workers you're probably paying a lot more for the box it's running on, ofc this is different if you're running on your own metal but I have other issues with that.


> Memory is not cheap when dealing the real world cost of deploying a production system.

What? What makes you say that? What did you think I was talking about if not a production system? To be clear, we're talking about the overhead of single-digit additional python interpreters unless I'm misunderstanding something...


Observed costs from companies running the pre fork worker model vs alternative deployment methods and just in the benchmark they're running double digit interpreters which I've seen as more common and expensive.


Double-digit interpreters per host? Where is the expense? Interpreters have a relatively small memory overhead (<10mb). If you're running 100 interpreters per host (you shouldn't be), that's an extra $50/host/year. But you should be running <10/host, so an extra $5/host/year. Not ideal, but not "expensive", and if you care about costs your biggest mistake was using Python in the first place.


I don't know where you're seeing the < 10mb from the situation I saw they were easily consuming 30mb per interpreter. Even my cursory search around now shows them at roughly 15-20mb so assuming the 30mb Gunicorn was just misconfigured that's still an extra $100 per host using your estimate and what I'm looking at Googling around and across a situation where there are multiple public apis that's adding up pretty quickly.

Another google search shows me Gunicorn, for instance, using high memory on fork isn't exactly uncommon either.

Edit: I reworded some stuff up there and tried to make my point more clear.


The interpreter overhead on macos is 7.7mb. I can't speak to gunicorn configuration but it's far from the only game in town.


Totally fair point, my experience with fork type deploys has only been Gunicorn so I'll take this as a challenge to try some others out.


Yes, C dependency management is awful, and because Python is only practical with C extensions for performance critical code, it ends up being a nightmare as well.


In our use case switching to asyncio it's like moving from 12 cores to 3... (And I'm pretty sure we are handling more concurrency... from 24-30 req/s to 150req/s But our workload is mostly network related (db, external services...)


same.

maybe author is concerned that many people are jumping the gun on async-await before we all fully understand why we need it at all. and that's true. but that paradigm was introduced (borrowed) to solve a completely different issue.

i would love to see how many concurrent connections those sync processes handle.


Hi - not sure what you mean by this. The sync workers handle one request (to completion) per worker. So 16 workers means 16 concurrent requests. For the async workers it's different - they do more concurrently - but as discussed their throughput is not better (and latency much worse).

Maybe what you're getting at is cases where there are a large number of (fairly sleepy) open connections? Eg for push updates and other websockety things. I didn't test that I'm afraid. The state of the art there seems to be using async and I think that's a broadly appropriate usage though that is generally not very performance sensitive code except that you try to do as little as possible in your connection manager code.


In the case of everything working smoothly that model may play out. But if you get a client that times out, or worse, a slow connection then they used one of your workers for a long time in a synchronous model. In the async model this has less of a footprint as you are still accepting other connections despite the slow progress of one of the workers.


yes many open connections is what i meant (suggested by other people as well). by the way, i really liked the writing, it's refreshing. and i agree with you that people aren't using async for the right reasons.


Thanks :) , really appreciate that. I think all technology goes through a period of wild over-application early on. My country is full of (hand dug) canals for example


I find it interesting that all the talk here is about performance, and nobody has mentioned any benefits of Async Python when performance isn't an issue.

I use trio/asyncio to more easily write correct complex concurrent code when performance doesn't matter. See "The Problem with Threads"[1].

For this use case, Async Python probably still isn't faster, but that doesn't matter. Let's not throw out the baby with the bathwater :)

[1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-...


I love asyncio for writing mixed initiative "servers". For instance, I have an asyncio "server" that accepts websocket connections on one side, waits on an AQMP queue, proxies requests and mediates for the HEOS smart speaker API, Phillips Hue, U.S. Weather Service, etc.

This is great for react or vue front end applications which get their state updated when things happen in the outside world (e.g. somebody else starts the music player, that gets related)

When CPU performance is an issue (say generate a weather video from frames) you want to offload that into another process or thread, but it is an easy programming style if correctness matters.


That sounds a lot like home-assistant :)


It is a little bit, except this one is customizable, maintainable, and not phonish in any way. In particular, there is no "one ring to rule them all" App but rather there are very simple one-task applications (put one button to pair the left/right computer to the soundbar via Optical or Coax) and also some applications that are highly complex (e.g. multiple windows)


Whats the point of writing concurrent code if its not faster?


Contrasting with jdlshore, concurrency can make programs much easier to reason about, when done well. This is a benefit of both Go and Erlang, though they use different approaches.

Concurrency can help you separate out logic that is often commingled in non-concurrent code, but doesn't need to be. As a real-world example, I used to do safety critical systems for aircraft. The linear, non-concurrent version, included a main loop that basically executed a couple dozen functions. Each function may or may not have dependencies on the other functions, so information was passed between them over multiple passes through this main loop (as their order was fixed) using shared memory.

A similar project had about a dozen processes, each running concurrently. There was no speed improvement, but the connection between each activity was handled via channels (equivalent in theory to Go's channels, less like Erlang's mailboxes as the channels could be shared). We knew it was correct because each process was a simple state machine, separated cleanly from all other state machines.

The second system's code was much simpler, there was no juggling (in our code) of the state of the system, compared to managing the non-concurrent logic. If a channel had data to be acted on, the process continued, otherwise it waited. Very simple. And it turns out that many systems can be modeled in a similar fashion (IME). Of course, we had a very straightforward communication mechanism (again, essentially the same as Go channels except it was a library written in, as I recall, Ada by whoever made the host OS).


Signals are not dependent on concurrency. And you don't need multiple processes to implement a state machine.

I mean think about it. Whats the difference between sending message A and then message B versus sending messages A and B into a queue and letting some async process pop from it? Less complexity and guaranteed message delivery come for free in single-threaded code.

Am I wrong? What am I missing?


I don't think you're wrong, but in Jtsummers' specific case, I think multi-processing probably would be simpler. You don't have to implement the event loop, there's no risk of tromping on other processes' data, and if a process gets into an invalid state, you can just die without impacting others.

You'd need a good watchdog and error handling, but presumably some of that came for "free" in their environment.

Although if you take out the "free" OS support, watchdog, etc., I agree that there's likely a place between "shared memory spaghetti" and "multi-processing" that's simpler than both.


Exactly this. I had started my own reply and refreshed and saw yours, thanks.

The other benefit of the concurrent design (versus the single-threaded version) was that it was actually much simpler. This was critical for our field because that system is still flying, now 12 years later, and will probably be flying for another 30-50 years. The single-threaded system was unnecessarily complex. Much of the complexity came from having to include code to handle all the state juggling between the separate tasks, since each had some dependency on each other (not a fully connected graph, but not entirely disconnected either). The concurrent design made it trivial to write something very close to the most naive version possible, where waiting was something that only happened when external input was needed. So the coordination between each task just fell out naturally.

You still have to care about locking the system up, but in our case because each process was sufficiently reduce to its essentials, this was easy to evaluate and reason about.


"some async process" is a concurrency mechanism, is it not?


It is. The single-threaded example comes before the "versus". The async example comes after. I should have been more clear.


Ah, indeed misread that. Then my answer is: Singlethreaded code sometimes has to implement things an async environment would handle for you.

I.e. when handling many in- and outputs I can write my own loop around epoll etc, write logic to keep of track of queues of data to send per-target etc. Or I can use a runtime that provides that for me and lets me mostly pretend things are running on their own.


Concurrency is notoriously difficult to reason about. Concurrency bugs are also a f__king nightmare to debug.

Given how slow I/O operations are, and how much modern code depends on the network, we typically need some concurrency in our code. So for me, almost always, the question isn't, "which concurrency choice is fastest?" but rather, "which concurrency choice is fast enough while leading to code with the least bugs?"


If you are I/O bound, concurrency has a use case. I don't argue against it. I'm pointing out that its pointless to write concurrent code if you don't expect a performance benefit from it.

It's like multi-threading 2+2.



I use async for UI work, but don't have much of an opinion for servers.

I suspect that the best async is that supported by the server OS, and the more efficiently a language/compiler/linker integrates with that, the better. JIT/interpreted languages introduce new dimensions that I have not experienced.

I do have some prior art in optimizing libraries, though. In particular, image processing libraries in C++. My opinion is that optimization is sort of a "black art," and async is anything but a "silver bullet." In my experience, "common sense" is often trumped by facts on the ground, and profilers are more important than careful design.

I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion, as you have the same timeline as sync, but with thread management overhead.

There are also hardware issues that come into play, like L1/2/3 caches, resource contention, look-ahead/execution pipelines and VM paging. These can have massive impact on performance, and are often only exposed by running the app in-context with a profiler. Sometimes, threading can exacerbate these issues, and wipe out any efficiency gains.

In my experience, well-behaved threaded software needs to be written, profiled and tuned, in that order. An experienced engineer can usually take care of the "low-hanging fruit," in design, but I have found that profiling tends to consistently yield surprises.

T.A.N.S.T.A.A.F.L.


Probably the most interesting new concept that I've come across is Linux's io_uring, which uses ring buffers to asynchronously submit and receive kernel I/O calls.

While Windows has had asynchronous I/O for ages, it's still one kernel transition per operation, whereas Linux can batch these now.

I suspect that all the CPU-level security issues will eventually be resolved, but at a permanently increased overhead for all user-mode to kernel transitions. Clever new API schemes like io_uring will likely have to be the way forward.

I can imagine a future where all kernel API calls go through a ring buffer, everything is asynchronous, and most hardware devices dump their data directly into user-mode ring buffers by default without direct kernel involvement.

It's going to be an interesting new landscape of performance optimisation and language design!


> profilers are more important than careful design.

> I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion

But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

I would expect profiling to mostly leads to micro-optimisations, eg combining or splitting the time a lock is taken, but when you're still designing you can look at avoiding as much need for synchronization as possible. eg: sharing data copy-on-write (not requiring locks as long as you have a reference) instead of having to lock the data when accessing it.

As another commenter says

> with asyncio we deploy a thread per worker (loop), and a worker per core. We also move cpu bound functions to a thread pool

you can't easily go from eg. thread-per-connection to a worker pool. that should have been caught during design


> But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?

Yes and no. Again, I have not profiled or optimized servers or interpreted/JIT languages, so I bet there's a new ruleset.

Blocking can come from unexpected places. For example, if we use dependencies, then we don't have much control over the resources accessed by the dependency.

Sometimes, these dependencies are the OS or standard library. We would sometimes have to choose alternate system calls, as the ones we initially chose caused issues which were not exposed until the profile was run.

In my experience, the killer for us was often cache-breaking. Things like the length of the data in a variable could determine whether or not it was bounced from a register or low-level cache, and the impact could be astounding. This could lead to remedies like applying a visitor to break up a [supposedly] inconsequential temp buffer into cache-friendly bites.

Also, we sometimes had to recombine work that we had sent to threads, because that caused cache hits.

Unit testing could be useless. For example, the test images that we often used were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed onto a well-lit table, with a few targets.

Then, we would run an image from a pro shooter, with a Western prairie skyline, and the lengths of some of the convolution target blocks would be different. This could sometimes cause a cache-hit, with a demotion of a buffer. This taught us to use a large pool of test images, which was sometimes quite difficult. In some cases, we actually had to use synthesized images.

Since we were working on image processing software, we were already doing this in other work, but we learned to do it in the optimization work, too.

When my team was working on C++ optimization, we had a team from Intel come in and profile our apps.

It was pretty humbling.


Cooperative multitasking came out slower than preemptive in the nineties, so this is unsurprising in the generic case.

I think my question is whether async Python is slower in the case it was designed for -- many, long-running open sockets.

Async was traditionally used server-side for things like chat servers, where I might have millions of sockets simultaneously open.


> Cooperative multitasking came out slower than preemptive in the nineties

This wasn't really the reason for the shift away from cooperative multitasking, it was really because cooperative multitasking isn't as robust or well behaved unless you have a lot of control over what tasks you have trying to run together.

In theory cooperative multitasking should have better throughput (latency is another story) because each task can yield at a point where its state is much simpler to snapshot rather than having to do things like record exact register values and handle various situations.


... I never meant to imply that performance was the reason for the switch.

We've had a track record of technologies which:

1) Automated things (reliving programmers from thinking about stuff)

2) Were expected to make stuff slower

3) In reality, sped stuff up, at least in the typical case, once algorithms got smart

That's true for interpreted/dynamic languages, automated memory management/garbage collection, managed runtimes of different sorts, high-level descriptive languages like SQL, etc.

Sometimes, it took a lot of time to figure out how to do this. Interpreters started out an order-of-magnitude or more slower than compilers. It took until we had bytecode+JIT that performance roughly lined up. Then, we started doing profiling / optimization based on data about what the program was actually doing, and potentially aligning compilation to the individual users' hardware, things suddenly got a smidgeon faster than static compilers.

There is something really odd to me about the whole async thing with Python. Writing async code in Python is super-manual, and I'm constantly making decisions which ought to be abstracted away for me, and where changing the decisions later is super-expensive. I'd like to write.


> In reality, sped stuff up ... That's true for interpreted/dynamic languages, automated memory management/garbage collection, managed runtimes of different sorts, high-level descriptive languages like SQL, etc.

None of that is true.

Even SQL modeling declarative work in the form of queries requires significant tuning all the time.

The rest of the list is egregious.

> things suddenly got a smidgeon faster than static compilers.

No, they did not.


> It took until we had bytecode+JIT that performance roughly lined up.

It really didn't. Yes, in highly specialized benchmark situations, JITs sometimes manage to outperform AOT compilers, but not in the general case, where they usually lag significantly. I wrote a somewhat lengthy piece about this, Jitterdämmerung:

https://blog.metaobject.com/2015/10/jitterdammerung.html

Discussed at the time:

https://news.ycombinator.com/item?id=10344601


Well, if you wanna go that route, in the general case, code will be structured differently. On one side, you have duck typing, closures, automated memory management, and the ability to dynamically modify code.

On the other side, you don't.

That linguistic flexibility often leads to big-O level improvements in performance which aren't well-captured in microscopic benchmarks.

If the question is whether GC will beat malloc/free when translating C code into a JIT language, then yes, it will. If the question is whether malloc/free will beat code written assuming memory will get garbage collect, it becomes more complex.


Objective-C has duck typing (if you want), closures, automated memory management and the ability to dynamically modify code.

And is AOT compiled.

GC can only "beat" malloc/free if it has several times the memory available, and usually also only if the malloc/free code is hopelessly naive.

And you've got the micro-benchmark / real-world thing backward: it is JITs that sometimes do really well on microbenchmarks but invariably perform markedly worse in the real world. I talk about this at length in my article (see above).


> That's true for interpreted/dynamic languages, automated memory management/garbage collection, managed runtimes of different sorts, high-level descriptive languages like SQL, etc.

Of the things you mention, I agree on SQL, and "managed runtimes" is generic enough that I cannot really judge.

I'm thoughroghly unconvinced about the rest being faster than the alternatives (and that's why you don't see many SQL servers written in interpreted languages with garbage collection).


Well, I think you missed part of what I said: "at least in the typical case" (which is fair -- it was a bit hidden in there)

There's a big difference between normal code and hand-tweaked optimized code. SQL servers are extremely tuned, performant code. Short of hand-written assembly tuned to the metal, little beats hand-optimized C.

I was talking about normal apps. If I'm writing a generic database-backed web app, a machine learning system, or a video game. Most of those, when written in C, are finished once they work, or at the very most have some very basic, minimal profiling / optimization.

For most code:

1) Expressing that in a high-level system will typically give better performance than if I write it in a low-level system for V0, the stage I first get to working code (before I've profiled or optimized much). At this stage, the automated systems do better than most programmers do, at least without incredible time investments.

2) I'll be able to do algorithmic optimizations much more quickly in a high-level programming language than in C. With a reasonably time-bounded investment in time, my high-level code tends to be faster than my low-level code -- I'll have the big-O level optimizations finished in a fraction of the time, so I can do more of them.

3) My low-level code gets to be faster once I get into a very high level of hand-optimization and analysis.

Or in other words, I can design memory management better than the automated stuff, but my get-the-stuff-working level of memory management is no longer better than the automated stuff. I can design data structures and algorithms better than PostgreSQL specific to my use case, but those won't be the first ones I write (and in most cases, they'll be good enough, so I won't bother improving them). Etc.


I am sorry to be blunt, but that sounds like a PR statement filled with nonsense.

> If I'm writing a generic database-backed web app

If you are writing a system where performance does not matter, then performance does not matter.

> a machine learning system or a video game. Most of those, when written in C, are finished once they work, or at the very most have some very basic, minimal profiling / optimization.

Wait, what? ML engine backends and high-level descriptions, and video games are some of the most heavily tuned and optimized systems in existence.

> At this stage, the automated systems do better than most programmers do, at least without incredible time investments.

General-purpose JIT languages are so far from being an actual high-level declarative model of computation that the JIT compiler cannot perform any kind of magic of the kind you are describing.

Even actual declarative, optimizable models such as SQL or Prolog require careful thinking and tuning all the time to make the optimizer do what you want.

> 2) I'll be able to do algorithmic optimizations much more quickly in a high-level programming language than in C.

C is not the only low-level AOT language. C is intentionally a tiny language with a tiny standard library.

Take a look at C++, D, Rust, Zig and others. In those, changing a data structure or algorithm is as easy as in your usual JIT one like C#, Java, Python, etc.

> 3) My low-level code gets to be faster once I get into a very high level of hand-optimization and analysis.

You seem to be implying that a low-level language disallows you from properly designing your application. Nonsense.

> I can design memory management better than the automated stuff, but my get-the-stuff-working level of memory management is no longer better than the automated stuff

You seem to believe low-level programming looks like C kernel code of the kind of a college assignment.


> If you are writing a system where performance does not matter, then performance does not matter.

It's not binary. Performance always matters, but there are different levels of value to that performance. Writing hand-tweaked assembly code is rarely a good point on the ROI curve.

> Wait, what? ML engine backends and high-level descriptions, and video games are some of the most heavily tuned and optimized systems in existence.

Indeed they are. And the major language most machine learning researchers use is Python. There is highly-optimized vector code behind the scenes, which is then orchestrated and controlled by tool chains like PyTorch and Python.

> Take a look at C++, D, Rust, Zig and others. In those, changing a data structure or algorithm is as easy as in your usual JIT one like C#, Java, Python, etc.

I used to think that too before I spent years doing functional programming. I was an incredible C++ hacker, and really prided myself on being able to implement things like highly-optimized numerical code with templates. I understood every nook and cranny of the massive language. It actually took a few years before my code in Lisp, Scheme, JavaScript, and Python stopped being structured like C++.

You putting "Python" and "Java" in the same sentence shows this isn't a process you've gone through yet. Java has roughly the same limitations as C and C++. Python and JavaScript, in contrast, can be used as a Lisp.

I'd recommend working through SICP.

> You seem to be implying that a low-level language disallows you from properly designing your application. Nonsense.

Okay: Here's a challenge for you. In Scheme, I can write a program where I:

1) Write the Lagrangian, as a normal Scheme function. (one line of code)

2) Take a derivative of that, symbolically. (it passes in symbols like 'x and 'y for the parameters). I get back a Scheme function. If I pretty-print that function, I get an equation render in LaTeX

3) Compile the resulting function into optimized native code

4) Run it through an optimized numeric integrator.

This is all around 40 lines of code in MIT-Scheme. Oh, and on step 1, I can reuse functions you wrote in Scheme, without you being aware they would ever be symbolically manipulated or compiled.

If you'd like to see how this works in Scheme, you can look here:

https://mitpress.mit.edu/sites/default/files/titles/content/...

That requires being able to duck type, introspect code, have closures, GC, and all sorts of other things which are simply not reasonably expressible in C++ (at least without first building a Lisp in C++, and having everything written in that DSL).

The MIT-Scheme compiler isn't as efficient as a good C++ compiler, so you lose maybe 10-30% performance there. And all you get back is a couple of orders of magnitude for (1) being able to symbolically convert a high-level expression of a dynamic system to the equations of motion suitable for numerical integration (2) compile that into native code.

(and yes, I understand C++11 kinda-added closures)


> And the major language most machine learning researchers use is Python.

Read again what I wrote. Even the model itself is optimized. The fact that it is written in Python or in any DSL is irrelevant.

> I used to think that too before I spent years doing functional programming.

I have done functional programming in many languages, ranging from lambda calculus itself to OCaml to Haskell, including inside and outside academia. It does not change anything I have said.

Perhaps you spent way too many years in high-level languages that you have started believing magical properties about their compilers.

> prided myself on being able to implement things like highly-optimized numerical code with templates.

Optimizing numerical code has little to do with code monomorphization.

It does sound like you were abusing C++ thinking you were "optimizing" code without actually having a clue.

Like in the previous point, it seemed you attributed magical properties to C++ compilers back then, and now you do the same with high-level ones.

> It actually took a few years before my code in Lisp, Scheme, JavaScript, and Python stopped being structured like C++.

How do you even manage write code in Lisp etc. "like C++"? What does that even mean?

> You putting "Python" and "Java" in the same sentence shows this isn't a process you've gone through yet. Java has roughly the same limitations as C and C++.

Pure nonsense. Java is nowhere close to C or C++.

> Here's a challenge for you.

I would use Mathematica or Julia for that. Not Scheme, not C++. Particularly since you already declared the last 30% of performance is irrelevant.

You are again mixing up domins. You are picking a high-level domain and then complaining a low-level tool does not fit nicely. That has nothing to do with the discussion and we could apply that flawed logic to back any statement we want.


> Perhaps you spent way too many years in high-level languages that you have started believing magical properties about their compilers.

> It does sound like you were abusing C++ thinking you were "optimizing" code without actually having a clue.

> Like in the previous point, it seemed you attributed magical properties to C++ compilers back then, and now you do the same with high-level ones.

I think at this point, I'm checking out. You're making a lot of statements and assumptions about who I am, what my background is, what I know, and so on. I neither have the time nor the inclination to debunk them. You don't know me.

When you make it personal and start insulting people, that's a good sign you've lost the technical argument. Technical errors in your posts highlight that too.

If you do want to have a little bit of fun, though, you should look up the template-based linear algebra libraries of the late nineties and early 00's. They were pretty clever, and for a while, were leading in the benchmarks. They would generate code, at compile time, optimized to the size of your vectors and matrixes, unroll loops, and similar. They seem pretty well-aligned to your background. I think you'll appreciate them.


Yes, the whole hoopla about async and particularly async/await has been a bit puzzling, to say the least.

Except for a few very special cases, it is perfectly fine to block on I/O. Operating systems have been heavily optimized to make synchronous I/O fast, and can also spare the threads to do this.

Certainly in client applications, where the amount of separate I/O that can be usefully accomplished is limited, far below any limits imposed by kernel threads.

Where it might make sense is servers with an insane number of connections, each with fairly low load, i.e. mostly idle, and even in server tasks quality of implementation appears to far outweigh whether the server is synchronous or asynchronous (see attempts to build web servers with Apple's GCD).

For lots of connections actually under load, you are going to run out of actual CPU and I/O capacity to serve those threads long before you run out of threads.

Which leaves the case of JavaScript being single threaded, which admittedly is a large special case, but no reason for other systems that are not so constrained to follow suit.


> Function colouring is a big problem in Python

Not when you know how to call sync functions from async functions and vice versa.

An sync function can call an async function via:

  loop = asyncio.new_event_loop()
  result = loop.run_until_complete(asyncio.ensure_future(red(x)))
A async function can call a sync function via:

  loop = asyncio.get_event_loop()
  result = await loop.run_in_executor(None, blue, x)
Where red and blue are defined as:

  async def red(x):
        pass

  def blue(x):
      pass
Note that the documentation is wrong about recommending create_task over ensure_future. That recommendation results in more restrictive code as create_task only accepts a coroutine and not a task.

This works for regular functions I don't know how it works for generators.


You perfectly illustrated why this is a problem. Calling functions from one side to the other involves ceremony. Ceremony adds cognitive overhead and decreases readability.


Would that work for you ?

    result = asyncio.run(red(x))
That's calling async function red from non-async code.

Not seeing any particular readability issue with that usage neither. If you don't call asyncio.run or await on the result of an async function call, then you get a coroutine for result.


> result = asyncio.run(red(x))

Still much harder to think about/read then

   result = red(x)
We should be finding ways to get to the latter with concurrency. async/await is at best a patchwork compromise until we can do better.


I much prefer the performance implications of my code be explicit than implicit.

Async in this case has less ceremony than threads. But still enough to make thing explicit.


I've written Python functions that "call" another function either async or not depending on how the function inspects.

For instance, imagine a "maybe_await" method that just calls sync if is synchronous or otherwise awaits.


That's trivial:

    result = yourfunc()
    if inspect.iscoroutine(result):
       result = asyncio.run(result)
If I have code like that, it's at one place per library, didn't need to encapsulate it in a maybe_await function.


It's very heavy to do this is it not? Like you inspect the function on each call to figure out if it's async or not?


For things that happen at UI speed it isn't bad. If I wanted to do something a million times a second I'd worry about it.

Some Javascript frameworks, such as Vue, often do something similar in that you can pass either a sync or async callback and it does the right thing for either. In that case you could potentially inspect the function once and call it many times.


You don't inspect the function, but the result. Is it a coroutine ? If so, it needs to be awaited (see my example above).

I believe this also would allow non-async function to return a coroutine I suppose.

Anyway, in this case chances are that there will be no performance overhead if there's any io bound operation running in the coroutine so it should run the iscoroutine check of the result and the await call before the function is done.


Writing such code to call between sync/async makes me cry man - this is so ugly. I'd still consider it a problem.


Actually Python now offers an asyncio.run function. Your example may now be:

result = asyncio.run(red(x))


alternatively, one can use gevent and get a transparent asyncio from a modified runtime - something that a high-level language should've provided out of the box.


Hiding awaitables from the language, sounds like against the zen (explicit better than implicit)

For example, when someone access a descriptor in Django.. this could end being a query to the db (transparent) but dangerous. With asyncio you explicitly await something to return the execution to the event loop.

At least for me sounds like a safer behaviour


Hiding awaits would be against the zen. As you explained, it's a behaviour difference that might matter. The await forces you to think about it, and that's safer.

But the difference between asyncio.run(red(x)) and blue(x)... isn't. There's no difference which matters. They are just different implementations of the same behaviour.

If red and blue are both the same DB query, with the only difference being red is async-style and blue sync-style, these two lines have exactly the same program behaviour:

    result = asyncio.run(red(x))
and

    result = blue(x)
So the asyncio.run is just cognitive fog. It forces you to think about the type difference, but doesn't add any safety.

It's almost the opposite of Python's usual duck-typing parsimony, which normally allows equivalent things to be used in place of each other without ceremony.


> Hiding awaitables from the language, sounds like against the zen (explicit better than implicit)

Zen is not respected by explicit asyncio, just try to compose asyncio with iterators [1]

[1] https://stackoverflow.com/questions/42448664/async-generator...

This problem doesn't exist with gevent, and composability is a desired thing in any programming language. Python's asyncio fractioned the community that was previously doing implicit asyncio with sync interfaces, and the current state of API is not an example of composable primitives that follow the Zen of Python:

> Beautiful is better than ugly.

> Simple is better than complex.

> Readability counts.

> Special cases aren't special enough to break the rules.


Read that thread in full, it's also a case of explicit is better than implicit. (Async for vs for loops, essentially)


Should I start changing my code, from "foo.bar" to "foo.__getattribute__('bar')" ? Probably a bad compare, but I'm looking for someone to tell me why.

Meanwhile, my pretty python foo = bar().something has gone all foo = (await bar()).something


What does gevent do - give Python something similar to Goroutines?


yes, pretty much, with a few specifics - https://sdiehl.github.io/gevent-tutorial/#greenlets


Async python is faster when you use it for running parallel tasks. In this benchmark, you are running a single database request per query, so there is no advantage to being asynchronous: a pool of processes will scale just as well (but it will use more memory). The point of async is that it lets you easily make a Postgres query, AND an HTTP query, AND a redis query in parallel.


Couldn’t threads handle that use case?


Yes they can. But threads are a pain to work with in python, as compared to async.


One big difference between one thread per request vs single-threaded async code is that synchronization and accessing shared resources is trivial when all of your code is running on a single thread.

An entire category of data races like `x += 1` become impossible without you even thinking about it. And that's often worth it for something like a game server where everything is beating on the same data structures.

I don't use Python, so I guess it's less of an issue in Python since you're spawning multiple processes rather than multiple threads so you're already having to share data via something out of process like Redis and using its own synchronization guarantees.

But for example the naive Go code I tend to read in the wild always has data races here and there since people tend to never go 100% into a channel / mutex abstraction (and mutexes are hard). And that's not a snipe at Go but just a reminder of how easy it is to take things for granted when you've been writing single-threaded async code for a while.


FWIW, Rust gives you the same simplicity (no data races at runtime) with threads as well.

(Not necessarily on topic, but if you’re really excited about dodging data races, I figured it would give you something fun to look at!)


Not in the same way though, it catches the possibility of data races and forces you to rewrite until all the memory accesses are safe. That's more complex to program, you might need to redesign some of your data structures, for example.


This reminds me of Rob Pike’s talk from Golang about how concurrency is not parallelism. I think the python community may be hitting this issue where async is meant to model concurrent behavior not always or necessarily facilitate parallel activity


I think a good chunk of Python developers expected (expect?) async to be a "get out of GIL free card". It's not.


Techempower [1] has a really great collection of benchmarks using highly controlled test setups that I like to look at to compare web frameworks. Not affiliated with them, but it's relevant to the post.

[1] https://www.techempower.com/benchmarks/#section=data-r19&hw=...


Async is useful for high IO where you may have a lot of down time between the requests. Are you pulling many requests from different servers with different response times, communicating with a db or pulling out large response bodies. Async is probably going to do better since each one of those synchronously represents potentially large idling periods where other requests could have gotten work done.

As to the article the comparisons are good but fails to mention resource constraints, like Gunicorn, forking 16 instances is going to be a lot heavier on memory so for a little more RPS you're probably spending a decent chunk of change more to run your work and I don't think that's worth it considering the Async model in python is pretty easy to grok these days and under this benchmark share a similar performance profile.

Now that said If I had to guess these numbers are fine for the average API but if you're doing something like high throughput web crawling or need to serve something on the order of 10's of thousands to hundred thousands RPS async will win out on speed and resource use and ultimately cost.

Plus at one point they were like "we could only get an 18% speed up with Vibora" haven't used them my self. But 18% performance increase at really any level of load is fantastic. Hand waving that off tells me the work loads for what is "realistic" don't take in to account real high RPS workloads like you might see at major tech companies.


> forking 16 instances is going to be a lot heavier on memory

It really depends on how the application is designed. Fork operates through mmap and copy-on-write. It's extremely lightweight by default.

A well-designed fork-based application will already have everything necessary to run a given process into memory, not munge any of the existing shared memory, and only allocate and free memory associated with new events/connections/etc.

When programmed that way, individual forks are incredibly light on resources. All the workers are sharing the exact same core application code and logic in memory.


"All the workers are sharing the exact same core application code and logic in memory."

Oh interesting, are you saying an intelligent forking implementation is able to share static portions of memory with multiple children?

I was perhaps under the naive assumption forking was pretty much just a full memory copy of the parent.


Yep, but that's simply how the linux kernel works [1]. As a programmer you need to essentially load up all your modules/libraries/data/etc that you will need into memory before the fork, and treat the forked() processes as read-only as much as possible from a resource perspective. If you modify anything from the parent, you get your own page as soon as that happens. [2][3]

[1] https://www.informit.com/articles/article.aspx?p=368650

[2] https://en.wikipedia.org/wiki/Copy-on-write

[3] https://en.wikipedia.org/wiki/Fork_(system_call)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: