Hacker News new | past | comments | ask | show | jobs | submit login
Is parallel programming hard, and, if so, what can you do about it? (kernel.org)
184 points by nequo 11 months ago | hide | past | favorite | 191 comments



I see a lot of confusion between parallel programming [1] and concurrent programming [2] in the comments here.

The former and what this book is about deals with the problem of parallelizaing a single sequential program. There usually is strong interaction or dependencies between elements and progress needs synchronization. E.g. timestep iterations in real-time simulations that need synchronization with data communication after each timestep. These simulation also tend to get way to big to be run on a single machine, lest a single thread, and get scaled up to millions of cores/threads in supercomputers.

Concurrent programming is what most developers working with the internet are more familiar with. You have mostly independent tasks that you want to run concurrently. "A concurrent system is one where a computation can advance without waiting for all other computations to complete." [2] E.g. nginx serving thousands of user requests at the same time.

The problem domains have a lot of overlap on the basics (e.g. threading), however the focus is very different. Things like synchronization (mutex, barriers), cache locality and memory bandwith & latency play a central role in parallel programming, while concurrent programming focuses more on the engineering challenge of distributing independent tasks across multiple threads or machines.

[1] https://en.wikipedia.org/wiki/Parallel_computing

[2] https://en.wikipedia.org/wiki/Concurrent_computing


The book covers it in Appendix A.6 (p 424) in the v2023.06.11a PDF file.

> A.6 What is the Difference Between “Concurrent” and “Parallel”?

> From a classic computing perspective, “concurrent” and “parallel” are clearly synonyms. However, this has not stopped many people from drawing distinctions between the two, and it turns out that these distinctions can be understood from a couple of different perspectives.

> The first perspective treats “parallel” as an abbreviation for “data parallel”, and treats “concurrent” as pretty much everything else. From this perspective, in parallel computing, each partition of the overall problem can proceed completely independently, with no communication with other partitions. In this case, little or no coordination among partitions is required. In contrast, concurrent computing might well have tight interdependencies, in the form of contended locks, transactions, or other synchronization mechanisms.

> This of course begs the question of why such a distinction matters, which brings us to the second perspective, that of the underlying scheduler. Schedulers come in a wide range of complexities and capabilities, and as a rough rule of thumb, the more tightly and irregularly a set oparallel processes communicate, the higher the level of sophistication required from the scheduler. As such, parallel computing’s avoidance of interdependencies means that parallel-computing programs run well on the least-capable schedulers. In fact, a pure parallel-computing program can run successfully after being arbitrarily subdivided and interleaved onto a uniprocessor. In contrast, concurrent computing programs might well require extreme subtlety on the part of the scheduler.


Well, funnily enough this does read in contrast to the definitions used in Wikipedia, which are the ones I am also familiar with (I also do teach a class called "Parallel Programming" to graduates).

I do think the differentation make sense from a perspective of problem classes, as also evident from the comments here. Running independent problems in parallel to better utilize hardware ressources is very different from running problems in parallel in timesteps that have strong dependencies in regards to progress of the overall computation. And that's not a problem of the scheduler, but a much more general concept.

It doesn't sound to me like the author has the whole web service parallelism/concurrency in mind that is very apparent in the comments here.


The definition found in Wikipedia, like many contentious subjects in programming, was written by people with strong political agenda and very little respect to the matter being described.

This applies to all sorts of ambiguous terms used very generously in the witchcraft of "applied computer science". Other examples include "object-oriented programming", "statically- or dynamically-typed language", "interpreted language", "dependency inversion", a bunch of "software patterns" and more. All this terminology is meaningless because there's never a way to tell if a language is object-oriented or not, if it's statically-typed or not and so on. Parallel vs concurrent is just one of those things where emotional attachment won over any attempt at rational thinking.


Uhh, I think it is pretty commonly accepted that a statically-typed language has typechecking facilities before runtime and a dynamically-typed language doesn't. Maybe there is a spectrum and things somewhere in the middle gradual typing but the general idea is quite clear.


1. That's not part of the language, it's part of an implementation. I.e. it means that the same language can be both statically-typed and not (according to your "definition"). Which is literally nonsense (in the sense true = false). 2. If something is commonly accepted doesn't make it anymore true. In the context of programming, a lot of commonly accepted beliefs are nonsense, this one isn't an exception.

It's not about spectrum. It's about inability of a lot of people to critically assess information coming from otherwise reputable sources.


What are you talking about? Languages are pretty explicitly designed to be statically- or dynamically-typed: e.g. take Python which relies a lot on having a dynamic typing discipline. Yes, you have Mypy and pytype, but those are pretty much different dialects.

Also, what reputable sources are you even talking about? That is also such an unwarranted personal attack you've attached as well.


I think there is a point here if you ignore the grumpy old man delivery.

For example, even python which is dynamically typed, is strongly typed also, so you now have to not just know the distinction between statically typed and dynamically typed but also between strong and weak. Then stray away slightly from vanilla python interpreter/runtime to any of the other flavors and you now have to reason about compile time, runtime, interpretation time, bytecode, interop with jvm etc. So there is a point that the language is a way of expressing something and the implementation is where the devilish details lie. You can argue jython isn't python or whatever and that's all well and good, but you can't really discuss all of this with other reasonable humans without getting into the details of implementation. Sure, for a leetcode level of understanding it doesn't matter much, but try to do something sufficiently complicated like build an os extension in python that interops with your c++ based api and you'll have to think about the implementation of the projections, and then port it to arm and you'll have to think about it all over again.


that description is... not accurate

concurrent is about logical independence, parallel is about physical independence


I don't know who invented this nonsense distinction. First time I was introduced to this idea of "concurrent programming" being a separate thing when Go was released. So, I associate this nonsense with Go, but it could have happened earlier, I simply never heard about it before then.

Anyways. The way I see it used today, it's applied to language runtimes incapable or severely crippled when it comes to parallel / concurrent execution. Eg. Python, JavaScript etc. In such environments programmers are offered a mechanism that has many downsides of parallel / concurrent programming (eg. unpredictable order of execution) without the benefits of parallel / concurrent programming (ie. nothing actually happens at the same time, or only sleep is possible at the same time etc.)

I feel like this distinction, while a nonsense idea at its core, became so popular due to the popularity of language runtimes with disabilities and their users needing to validate their worth by adding features to their languages their runtimes are inherently incapable of implementing.

Similar situation happened with ML-style types. Python, for example, works very poorly with this add-on, but the desire to match features of other languages led Python developers to add those types anyways. Similarly, TypeScript and a bunch of similar languages, especially in Web.


How is this nonsense or anything to do with "language runtimes with disabilities"? An OS running on a single core processor cannot be parallel but it may be concurrent: it can never physically do two things at the same time, but it might be able to logically interleave different tasks.

Parallelism is a physical thing, concurrency is a logical thing.


Parallelism is a physical thing, concurrency is a logical thing.

Fundamentally the difficulty is all about synchronization. People can try to split hairs and say there are two terms for two different things but ultimately it doesn't matter because the underlying problem is the same.


Single core concurrency doesn't have to deal with hardware memory synchronization.


> Single core concurrency doesn't have to deal with hardware memory synchronization.

Yes it does. If you have two threads which (for whatever reason) are sharing memory and they can be run in an arbitrary order, you have to deal with memory synchronization. If it's a single core machine, you still need to synchronize your memory access. Mutexes and semaphores predate systems with multiple processors.


Yes, but those are not hardware memory synchronization mechanisms. You do not have to deal with memory fencing to coordinate reads/writes at the CPU level, because there is only one core that can read memory at a time.


What point are you trying to make now? You just said:

Parallelism is a physical thing, concurrency is a logical thing.

So by your own definition, wouldn't multi-core concurrency be parallelism?


Yes? Parallelism is inherently concurrent but concurrency is not inherently parallel. But single core concurrency still exists as a form of non-parallel concurrency.

The point I'm trying to make is down to brass tacks, single core concurrency is a lot simpler than multi core concurrency, because you don't need any hardware cooperation.


Before you were saying they are two different things, now you're saying they are the same when you have multiple cores and not the same when you have a single core. If they're the same when you have multiple cores, isn't there just single and multi-core concurrency by your own definitions? If so, why are you making a distinction of parallelism and concurrency?


Parallelism is concurrent.

Concurrency isn’t necessarily parallel.

There are still plenty of single core systems out there running concurrent software, even today, for example embedded systems.


You have just repeated the nonsense I was talking about.

The claim you repeat is meaningless. A program is either parallel / concurrent or not. The situation you describe (when there's a single processor core) isn't parallel or concurrent. In some sense, it emulates concurrent / parallel execution because it imitates the unpredictable ordering of code execution, which sure has its uses... but the whole point of dealing with this unpredictable ordering is that we actually want parallelism / concurrency. The emulation on its own is worthless.


No it's not? How do you characterize running two programs on a single core without calling it concurrent but not parallel? There is a clear distinction between logical multitasking and physical multitasking that I think is useful to taxonomize.


bruh your OS scheduler wants a word with you


a program (as written) is either concurrent or not, a program (as executed) is either parallel or not

a program which is not written as concurrent can never be executed as parallel

a program which is written as concurrent can be executed as parallel, or not

> The situation you describe (when there's a single processor core) isn't parallel or concurrent.

a program running on a single core can never be parallel, but it can be concurrent

concurrent is a logical property, parallel is a physical property


"a program which is not written as concurrent can never be executed as parallel"

Unless of course you're running multiple independent instances of it, each with different parameters, which I gather is a pretty common way of running long running CPU- intensive operations on large data sets. Presumably there's often some final separate step that may be needed to combine the results once they're all finished, which if run manually obviates any concurrency concerns at the software level.


in which case that program is being executed N times, so none of this applies

("a program [as executed]" is a single instance (process) of a binary)


Yes, concurrency on a single core CPU is simply not possible. That's why multitasking OS's didn't exist until multicore CPUs became a thing.

If only these "Gophers" knew anything about computing history!


So before 2001 your OS didn't have a scheduler and was incapable of handling more than one process at the same time? The user interface was simply hanging when you told your CPU to do something?


That was the point I was attempting to get across. Single core computers multitask fine, which proves that concurrency is not the same thing as parallelism.

Even in modern times, we have something like the first generation raspberry pi zero, it has a single ARM core, and it multitasks fine.


Your "single core" computer contains a lot of other concurrently operating hardware like storage and network cards.


>> don't know who invented this nonsense distinction ...

It not nonsense. In C or C++ lots of code can be made parallel using OpenMP and inserting some #pragma statements above for loops. This does not work for things like running a UI in one thread and some other work in another thread, perhaps displaying results as they are found. These are quite different types of parallelism.


OpenMP versus std::thread/pthread/etc isn't this distinction. Both are thread-based parallelism; that OpenMP implementations do so cleverly via a threadpool approach is secondary. Any loop you can write using OpenMP you could dispatch manually via an explicit thread-based approach, and long-running parallel task you'd normally implement via std::async/std::future/std::thread could be dispatched via OpenMP. There are good conventions about what kinds of operations are better expressed via one or the other mechanism, but they are effectively equivalent in capability.


What does this have to do with anything?

I mean, great... you discovered a somewhat useful library: OpenMP... so what? How does this factoid affect the validity of the definition of code parallelism?


I think it's nonsense to invent new terminology for this tiny minor difference in how you use threads and whether you wait on results or you don't wait.


I'm sorry but you are betraying a myopic view here. Concurrency as a computer science concept has existed at least since the existence of co-routines which have been around for 60+ years. Parallelism as a concept has also existed for about as long. I find it fascinating that you take issue with "concurrency" when that one is the older concept in the engineering zeitgeist. If anything, parallelism is the odd one because it only became sounding average engineers started thinking about when consumer CPUs started becoming multi threaded.


but it's just not a tiny minor difference. The problems you end up dealing with are entirely different.

The terminology used might be terrible since they're originally synonyms. Differentiating between the two is very useful though.


how are the problems different? In all cases you want the maximum performance. The only decision here is how you schedule your tasks. Ideally entirely in parallel and you only schedule them to wait if you can't find a way to make them 100% parallel


Parallelism and concurrency are different models. It is not a needless distinction.

With just concurrency, you can have a system, e.g. an event processing system, where each event happens, is processed to completion, and then the next event is processed. Events can be of different types and can have different handlers. Events can be arrive (or be delivered) in different orders. The context-switching points are known (in this case, the beginning and end of event processing). As such, any computation between context switching points appears to happen atomically (i.e. no partial update, no internal reordering).

Parallelism is a different model.

There are no well-defined, limited set of interaction points between computations. A computation could literally be interrupted at any machine-level instruction and another computation start running. Suddenly, the number of possible interactions is huge--almost beyond comprehension. Locks, transactions, or other synchronization mechanisms are necessary in order to create larger atomic regions (and defend against race conditions). Worse, with weak memory models, which most modern multi-core hardware have, means that interleaving alone is not enough to explain the possible interactions between threads. It's possible with weak memory models that writes from another computation appear in different orders to different threads. Weak memory models make avoiding race conditions absolutely paramount.


This is word salad. Sorry.

What are you modeling?

The way you use "concurrency" it's indistinguishable from code without any signs of parallelism. From your "definition", concurrency is just any code. Such definitions are called "trivial" if you don't want to offend the author, and "worthless" if you are honest.

Your attempt at defining "parallelism" is even worse... You start by calling it undefined, and then proceed throwing poorly connected verbs and nouns...

Let me make it simple:

Parallel or concurrent code is code that doesn't require time interval between instructions (this is in contrast to Von Neumann model, where it's necessary to have a non-zero time span between instructions).

Here. That's it. No "but hardware", no "there are no well-defined" etc.


Your comment is subtractive from the discussion and its tone and attitude is not OK. If you came here to make no attempt to understand what people have written, even when they are explaining words, defining terms, and instead attack them and shout past them, we're better off if you didn't.

I wouldn't normally put such a fine point on it, but in this thread you've managed to make a lot of noise and calls lots of things "nonsense". If you're after emotional responses and heated discussion, that's tantamount to trolling, and you should stop.


a program which is concurrent expresses logically independent paths of execution, which can be executed one after the other, or at the same time, or any mix thereof -- it is a logical property of the program as expressed, and not necessarily exploited, or observable, when the program is run

a program which is parallel actively demonstrates multiple paths of execution during runtime -- it is a physical property of the program as executed

tl;dr: every parallel program is concurrent, not every concurrent program is parallel


> In all cases you want the maximum performance.

Not necessarily true. In many cases of concurrent programming, you don’t actually care about performance, you care about multiple different things happening seemingly at the same time (sure faster is nice, but not stuttering or freezing is more important). For example on a single core multitasking system (either in the past or on an embedded system today), you want multiple different tasks all running together, but this is achieved by giving them all a little chunk of execution one after the other — time slicing. The tasks are running concurrently, they all make progress together, but they share the same single thread of execution, they don’t happen in parallel. Yet one doesn’t block another, which is the important thing here.

I mean may e you “want” maximum performance, but that’s not what’s most important. Not having one task block another is what’s important.

Parallel programming is pretty much always about making things faster, by running then in parallel at the same time. The main benefit of doing that is that they both complete faster.


If there is no dependency between your results, you do not need synchronization. Parallelizing it just means putting them into a seperate stream of execution. That can be a thread or an entirely different computer and the two different execution streams need to synchronization or communication between each other.

That is an entirely different set of problems than having to deal with a computation that has close dependencies and now you need synchronization and communication to progress the compuation. You don't only have to think about how to synchronize your computation, you also have to think about how to distribute your data to begin with to minimize the need for synchronization. An entire set of problems that just don't exist in the former case.

To pick up a word you used: In one case you have many independent tasks and you want to run them as quickly as possible. In the other case you have a single task and you think about how to split that up to make it faster.


Concurrency didn’t necessarily require threads. Concurrency is not a difference in how you use threads, its logical computer science concept. Parallelism is a physical property and parallelism is almost always concerned with doing things a faster by doing them in parallel, concurrency can be done for other purposes (eg to not have long running tasks block work). All parallel programming is concurrent but not all concurrent programming is parallel.

The distinction matters because the focus is different: they’re different topics. With concurrent programming you deal with synchronisation and topics like lock-freedom and wait-freedom. In parallel programming the main focus is about work distribution and scheduling to process it faster.


I'm not sure I can identify when it started, but these were already the concepts commonly in use when I did my CS undergraduate work in the early 90s. I.e. it was in textbooks and course titles as established jargon.

Concurrency was the kind of thing worried about in OS design or Unix programming styles whether on a time-sharing system or some small scale multi-processing system. Coordination of heterogeneous sequential programs on some shared resources.

Parallelism was the topic of high-performance computing with combined use of many hardware resources to accelerate a single algorithm.

Of course these are simplifying abstractions, and real systems can get into the murky gray area that is both concurrent and parallel.


Rob Pike has a discussion on the distinctions between parallelism and concurrency. Concurrency is closely related to co-routines which are a distinct invention from threads/processes which are more related to parallelism.


Just because you didn't know about the concept doesn't mean the distinction is nonsense. I think they are similar but not the same, exactly for the reasons laid out in the comment you are replying too. Just because you hate the languages that support concurrent programming doesn't mean concurrent programming is meaningless. Any language using async/await (basically all of them these days) support concurrent programming, including languages such as Swift and C# which are nothing like Python or JavaScript.


> In such environments programmers are offered a mechanism that has many downsides of parallel / concurrent programming (eg. unpredictable order of execution) without the benefits of parallel / concurrent programming (ie. nothing actually happens at the same time, or only sleep is possible at the same time etc.)

From the developer's perspective it's a massive upside to not have to manage low-level details and just define how the event loop will call their code.

Any modern web browser has plenty of parallel execution behind the scenes, but the developer (and user) will just see concurrency which is much simpler to reason about. The order of execution doesn't matter if the things being executed aren't dependent. What matters more is that there's only one thread to think about. If they are dependent they shouldn't have been parallelized in the first place, so they're not.


What does this have to do with low or whatever other level?


Both I (and apparently the author of TFA) disagree with your definition of parallel programming. TFA gives an example of "embarrassingly parallel" programs as one way to make parallel programming simple.

The distinction I learned was: any time you have multiple logical threads of execution you have concurrency, any time you have multiple computations happening simultaneously, you have parallelism.

Multithreaded programming on a single core computer is concurrent, but not parallel. Vector processing is parallel, but not concurrent.


> The distinction I learned was: any time you have multiple logical threads of execution you have concurrency, any time you have multiple computations happening simultaneously, you have parallelism.

I like this distinction as it also splits the different problem domains quite well. And I don't think it contradicts my definition as much as you might think.

When you have an embarrassingly parallel program you do not have to deal with the problems that come from data dependencies and synchronization of your simultaenously running compuations on different threads/machines. You do not really have to think about your computation running in parallel, but just about how to put them into different execution environments to run them concurrently. So you end up doing "concurrent programming".

When you do not have an embarrassingly parallel program, you still use the base concepts of running something concurrently (e.g. threads), but now your main focus shifts on how the multiple compuations can happen simultaneously. Now you end up doing "parallel programming" or parallel computation.

In the end, the terminolgy here is less than ideal. My main point was that some kind of distinction matters as TFA clearly discusses different topics from what many people think about from a web dev perspective (e.g. async, futures, etc.)


Vector processing is not parallelism, but rather non-scalar. Specifically, it's a single operation that is able to do work on multiple data items, rather than parallel processors doing work at the same time.


It is data parallel. In the same way, I would label super-scalar CPUs machines that automatically perform parallel processing on a linear stream of instructions (taking advantage of so-called "instruction level parallelism."


I know we’re kind stuck with the term of art “concurrent”, and will forever have to explain the difference between the synonymous words “concurrent” and “parallel” — The Merriam Webster definition of “concurrent” uses the word “parallel”, for example.

Wouldn’t it be nice if we could come up with a better word for this that doesn’t literally overlap with ‘parallel’ and doesn’t need to deviate so far from it’s dictionary definition?

Personally I think of JavaScript as ‘asynchronous’, and I know this as a term of art means a programming model, but it’s a lot easier to see that async can be done with a single thread and isn’t necessarily parallel, right?


Parallel programming is simply a type of concurrent programming. Concurrent means that two tasks can both progress in a given duration. On a single core computer every thread runs concurrently. Parallel expands on concurrency to mean that two concurrent tasks can also run at the exact same time. On a multi-core computer threads can run in parallel. In many cases concurrent programming and parallel programming have little to no difference, and you program with the assumption that every task can run in parallel (for example, whenever you use async/await or the threadpool).


Watching geohot code a general matrix multiply algorithm from 0.9 GFLOPS and optimising it to 100 glops by only tinkering with cache locality, it makes me wonder how much effort should be put into single threaded performance before ever thinking about multi threading


I've seen stuff like that before with a game called Factroio, The only game I've ever see that is optimized so hard that your RAM Speed can affect large bases rather quickly, same with faster L2 Cache. Their entire blog series[1] covers a large part of how they did this. but for a game written mostly in LUA they sure did a good job on it.

1: https://www.factorio.com/blog/post/fff-204


Yes, the Factorio devs had an approach where they optimised everything happening in the game in the original singlethreaded environment, before moving onto multithreaded support. That's where the game is now, and as far as I understand it the multithreading occurs on each independent set of conveyor belts or belt lanes, and there's some info on that in this blog post[0] for anyone interested.

[0] - https://www.factorio.com/blog/post/fff-364


It makes sense that a simulation game like factorio would be memory bandwidth limited: each tick it needs to update the state of a large number of entities using relatively simple operations. The main trick to making it fast is reducing that amount of data you need to update each tick and arranging the data in memory so you can load it fast (i.e. predictably and sequentially so the prefetcher can do is job) and only need to load and modify it once each tick (at least in terms of loading and eviction from cache). The complexity is in how best to do that, especially both at once. For example, in the blog post they have linked lists for active entities. This makes sense from the point of view of not loading data you don't need to process, but it limits how fast you can load data because you can only prefetch one item ahead (compared to an array where you can prefetch as far ahead as your buffers will allow)


Note: Since this post, they've multithreaded a fair bit of the simulation as well. It runs insanely well, the entire game is a marvel of software engineering in my book.


Fintech has mostly determined that 1 thread can get the job done. See LMAX disruptor and related ideas.

What problems exist that generate events or commands faster than 500 million per second? This is potentially the upper bar for 1 thread if you are clever enough.

Latency is the real thing you want to get away from. Adding more than one CPU into the mix screws up the hottest possible path by ~2 orders of magnitude. God forbid you have to wait on the GPU or network. If you have to talk to those targets, it had better be worth the trip.


> What problems exist that generate events or commands faster than 500 million per second?

AAA games, Google search, Weather simulation, etc? I mean it depends on what level of granularity you’re talking about, but many problems have a great deal going on under the hood and need to be multi threaded.


I would add a qualifier of "serializable" to those events or commands. This is the crux of why fintech goes this path. Every order affects subsequent orders and you must deal with everything in the exact sequence received.

The cases you noted are great examples of things that do justify going across the PCIe bus or to another datacenter.


That’s half of it, the other half is each of those events is extremely simple so the amount of computation is viable with a single thread.

If individual threads were dramatically slower the architecture would get unpleasant by necessity. Consider the abomination that is out of order execution on a modern CPU.


Don't forget about audio!


I don't think audio applies here at all.

Each channel can be mixed in parallel, 44,100 samples per second is not much per channel and mixing isn't difficult.

Also most can be cached and don't need to be mixed or filtered in real time because they haven't been changed from the last play.


Sorry, I should have been more specific. I meant DAWs and computer music environments, not simple audio players. Modern DAWs are heavily multi-threaded.


I am curious on that. 500 million events per second sounds high. Even for games. That many calculations? Sure. I take "events" to mean user generated, though. And that sounds high.

Same for searches. Difficulty there is size of search space, not searches coming in. Right?


Google search isn't a good example. AAA games are a great example when you think about graphics. However, most of that is trivially parallelizable, thus "all you need to do" is assign vertices/pixels to different threads (in quotation marks as that's of course not trivial by itself, but a different kind of engineering problem).

However, once you get into simulations you have billions (or multiple orders of magnitude more) elements interacting with each other. When you simulate a wave every element depends on it's neighbors and the finer the granularity the more accurate your simulation (in theory at least).


Thinking of graphics, though, i would assume most of that is in the GPU side. Simulations do make sense, but I see games like Factorio still focused on single thread first. And then look for natural parallel segments.

That is all to say that millions of events still feels like a lot. I am not shocked to know it can and does happen.


There are no good solutions for something like factorio. There are solutions that work but they aren't worth the trouble. My personal recommendation is that you split the world into independent chunks. A big interconnected factorio map is a nightmare scenario because there is hardly anywhere where you can neatly split things up. Just one conveyor belt and you lose. Aka parallelize disconnected subgraphs.

So the game would have to be programmed so that conveyor belts and train tracks can be placed at region boundaries and that there is a hidden buffer to teleport things between regions. Now you need an algorithm to divide your graph to both minimize the imbalance between the number of nodes in the subgraph but also to minimize the edges between subgraphs.


Just dreaming over here, but if someone had the opportunity to rebuild a Factorio from the ground up, I bet they could design something massively scalable. Something based in cell automata, like how water flow works in Minecraft. Current cell state = f(prev cell state, neighboring cell states).

It would take some careful work to ensure that items didn't get duplicated or lost at junctions, and a back-pressure system for conveyor belt queues. Electrical signals would be transmitted at some speed limit.


This is a wrong view of the problem. Often times your application has to be distributed for reasons other than speed: there are only so many PCIe devices you can connect to a single CPU, there are only so many CPU sockets you can put on a single PCB and so on.

In large systems, parallel / concurrent applications are the baseline. If you have to replicate your data as its being generated into geographically separate location there's no way you can do it in a single thread...


> God forbid you have to wait on the GPU or network. If you have to talk to those targets, it had better be worth the trip.

few programs are CPU-bound, most programs are bottlenecked on I/O waits like these


As far as I know the LMAX disrupter is a kind of queue/buffer to send data from one thread/task to another.

Typically, some of the tasks run on different cores. The LMAX disruptor is designed such that there is no huge delay due to cache coherency. It is slow to sync the cache of one core to the cache of another core when both cores write to the same address in RAM. The LMAX disruptor is designed that each memory location is (mostly) written to by at most thread/core.

How is the LMAX disrupter relevant for programs with 1 core?


> How is the LMAX disrupter relevant for programs with 1 core?

It is not relevant outside the problem area of needing to communicate between threads. The #1 case I use it for is MPSC where I have something like an AspNetCore/TCP frontend and a custom database / event processor / rules engine that it needs to talk to.


It's quite rare for problems to be dominated by hot loops in the same way that matrix multiplication is.

Think about something like speeding up a compiler or a web server or a spreadsheet. There's no 50-line function that you can spend a few hours optimising and speed up the whole thing.

That's part of the reason why Python programs (except maths heavy stuff like ML) tend to be so slow despite everyone saying "just write your hot code in C". You can't because there is no hot code.


This advice dates back to when Python was primarily used by scientists to drive simulations. I remember hearing this advice in the early 2000s as a physics student using vpython to drive N-body simulations. They told us to do the physics in C, but everything else in Python due to the simulation math taking too long in raw Python. We couldn’t make the visualization run at a reasonable frame rate without those optimizations.

These days Python is being used for everything, even things without hot loops as you note. Yet the advice persists.


Fun video[0]. The optimization bit starts at 0:35. [0]: https://www.youtube.com/watch?v=VgSQ1GOC86s


That appears to be one of the major design goals for the Mojo programming language. It allows the developer to code at a high level of abstraction and indicate where parallel execution should be used. Then the execution environment automatically optimizes that at runtime based on the actual hardware available. That hardware may change drastically over the full lifecycle of the application code so in the future it will automatically take advantage of hardware advances such as the introduction of new types of specialized co-processors.

https://www.modular.com/mojo


Cache locality is certainly an important aspect and especially when it comes to matrix multiplication even just changing the loop order (and thus access pattern) has a huge performance impact. However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough. And the difference between these two might be as simple as sorting your input data first.

I teach parallel programming to graduates and the first exercise we give them is a sequential optimization for exactly that reason. Think about if your algorithm is efficient before thinking about all the challenges that come with parallelization.


> [re:matrix multiplication] However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough.

You have to be very careful about what 'big enough' means. In practice, Strassen multiplication is not faster than the naive algorithm until you get to the point where you're multiplying matrices with hundreds of rows/columns. Additionally, naive matrix multiplication is well suited to GPUs, while Strassen multiplication on the GPU requires temporary buffers and multiple jobs and sequencing and whatnot.

As a general rule, matrix multiplication with complexity better than the naive algorithm should probably not be used. Do naive matrix multiplication on the CPU. If you need it to be faster, do naive matrix multiplication on the GPU. If you need it to be faster, the numerical stability of your problem has probably already come a gutser and will get worse if you switch to Strassen or any of the other asymptotically faster algorithms.

And the algorithms faster than Strassen? Forget about it. After Strassen multiplication was invented, about a dozen or so other algorithms came along, slowly reducing that O(n^2.8) to about O(n^2.37188) or so. (most recently in 2022; this is still an area of active research) The problem is that for any of these algorithms to be faster than Strassen, you need matrices that are larger than what you can keep in memory. There is no big enough input that will fit in the RAM of a modern computer. One estimate I've heard is that if you convert every atom in the observable universe into one bit of RAM, and you use that RAM to multiply two 10^38 by 10^38 matrices to get a third 10^38 by 10^38 matrix, you're still better off using the O(n^2.8) Strassen multiplication instead of the state of the art O(n^2.37188) algorithm. The constant slowdown in the other algorithms really are that bad.


>However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough.

Sure, this might be the case from a theoretical point of view (as per definition) but this completely disregards the hidden constants that come to light when actually implementing an algorithm. There's a reason why for instance a state-of-the-art matrix multiplication algorithm [0] can be completely useless in practice: The input data will never become large enough in order to amortize the introduced overhead.

[0] https://arxiv.org/abs/2210.10173


Link for that ?



As a developer, I often choose higher-level APIs not listed in that article.

On Windows, OSX and iOS the OS userland already implements general, and relatively easy to use, thread pools. On Windows, see CreateThreadpoolWork, WaitForThreadpoolWorkCallbacks, etc. It’s easier to use threads with locks while someone else is managing these threads. On Apple, the pool is called “grand central dispatch” and does pretty much the same thing.

Modern Windows kernel supports interesting synchronization APIs like WaitOnAddress, WakeByAddressSingle which allow to implement locks without the complexity or performance overhead of maintaining special synchronization objects.

Linux kernel implements performant and scalable message queues, see mq_overview(7). And it has synchronization objects like eventfd() and pidfd_open() which allow to integrate locks or other things into poll/epoll based event loops.


GCD/libdispatch is a fantastic approach to concurrency and you can build and install support for non-Apple operating systems:

https://github.com/apple/swift-corelibs-libdispatch

Here’s a simple echo server:

https://github.com/williamcotton/c_playground/blob/master/sr...

Here’s a simple multithreaded database pool:

https://github.com/williamcotton/express-c/blob/master/src/d...


libdispatch idea of using specific queues for serialized concurrency is nice, but its abstractions on top are unfortunately not as efficiently designed as it could be; `dispatch_source` doesn't allow for direct completion based IO schemes (io_uring, IOCP), pushing to a `dispatch_queue` always requires a heap allocation, `dispatch_semaphore/dispatch_sync` blocks the thread instead of yielding asychronously (can cause "thread explosion"). Systems like Go don't have these constraints I don't think.


> `dispatch_semaphore/dispatch_sync` blocks the thread instead of yielding asychronously (can cause "thread explosion")

dispatch_group can do waits without blocking threads, but asynchronicity is overrated, and dispatch's design makes it easy to overdo. Having default global concurrent queues was probably a mistake.

Swift concurrency is a more modern design here.


If you’re in C++ land, you might take a look at Boost ASIO. It’s not just for IPC and would give you portable code.


And in C++ can also use this dead-simple header file for a nice high-level, modern threadpool using function objects (lambdas) for very easy parallelization of arbitrary tasks: https://github.com/progschj/ThreadPool


Simple bit of code for a simple threadpool.

I spotted a bug in it though which may manifest itself depending on platform.


What did you find? The code is quite widely used, and I’ve also used it extensively for years without any problems.


Please don't use posix message queues (`mq_*`, `mq_overview`, etc) when you're writing a program with a single address space (like something that uses threads in a single process).

Posix message queues (the `mq_*` functions) are much slower than optimized shared memory queues using typical atomics and have semantics that are unexpected to most (they persist like files after process termination, because of what they are designed to be used for, they have system level limits on size and size of items, etc).

A simple benchmark vs rust's `std::sync::mpsc` queue shows `std::sync::mpsc` is 28.6 times faster when using 1 producer and 1 consumer, and is 37.32 times faster with 2 producers and 1 consumer.


what about "green threads" that is not managed by the OS like https://tokio.rs ?


Tokio is using the Rust async features, which are not green threads. In the former code has to explicitly mark potential yield points, in the latter green threads can be scheduled by the runtime without any help from the code itself.

As a historical note, Rust used to have green threads but they were abandoned a long time ago. This is a good talk about both the differences between different forms of concurrency/async and Rusts history with them: https://www.infoq.com/presentations/rust-2019/ (includes a transcript)


I believe these green threads work well when there’s good support in both language, runtime and standard library. I have built complicated concurrent software in C# with async-await. I don’t program golang but I heard the concurrency model works rather well in golang too, probably for the same reason as C#: good support in the language and the runtime.

I have no idea about Tokyo. I don’t program Rust, and the feedback I read about async/await was mixed.


You only ever need this if you're trying to have hundreds of thousands to millions of threads. It's a very niche problem to have


It's a niche problem to have because our current programming paradigm treats concurrency as a second-class citizen. The invariants of most languages do not include those required for mass concurrency.

Functional programming better aligns with the requirements, which is how you arrive at Erlang and Elixir. Every map() function can be trivially replaced with the concurrent cmap(), because the side effects that would make it non-trivial are impossible to express.

Given that Moore's law as it applied to single threaded performance is dead and in the ground, it makes a lot of sense to start paying more attention to systems that treat concurrency as more than an afterthought.


Except in that scenario you absolutely don't want green threads but real threads. M:N threading (especially if N=1) doesn't help you get parallelism, and parallelism is what you need for modern CPUs.

So that again keeps green threads (and thus mass concurrency without parallelism) in the niche category.


Green threads do not make use of multiple cores of a modern processor.


It's entirely possible for a green threads implementation to schedule the green threads over multiple OS threads, thus making use of multiple cores. The programming language "Go" being a popular implementation of this.


M:N green threads run M green threads on N OS threads and thus use up to N processor cores


what makes you say this?

go programs definitely saturate modern server-class CPUs


On scientific computing (computational fluid dynamics, computational electromagnetics, etc.), parallel programing is a must. Most of the algorithms are not embarrassingly parallel, and need a significant amount of communication between threads during runtime. We mostly use one of the many MPI [0] libraries available for desktop and high-performance computing machines. Using these programming paradigms is difficult but tend to result in fantastic scalability for all types of engineering problems.

[0] https://en.m.wikipedia.org/wiki/Message_Passing_Interface


So this might be a naive question, but is it literally that you're simulating a fluid in parallel by having each thread simulate a portion of the space? And then the message passing between threads is the data about the fluid on the boundary?


Depending on the method, but generally yes, the physical domain is decomposed into the number of cores. Data is synchronized at processor boundaries at a very high rate to maintain consistency.


Computational fluid dynamics is actually a problem where parallelization doesn't get you much. It is primarily limited by memory bandwidth.


I am not sure where you get this information. Parallelization is everything in this space, hence why we have highly interconnected supercomputers to model the most difficult engineering problems. Typical runs use 30k+ cores for a single problem for weeks on end [0]. There are some special cases, such as Boltzmann/dvm [1] where invididual partitions of cells have millions of degrees of freedom, where memory bandwidth is the primary concerns. Even then, doing domain decomposition to a larger number of cores takes care of the issue.

[0] https://www.nas.nasa.gov/SC22/research/project12.html

[1] https://www.sciencedirect.com/science/article/abs/pii/S00219...


Those large number of cores come with a diminishing return (see this benchmark, for example: https://nusit.nus.edu.sg/services/hpc-newsletter/cfd-simulat...)


A poorly-optimized code that does not scale is not evidence that all simulation tools behave the same. Most of the tools in use by NASA, DOD, DOE, research institutions and commercial codes scale very well. Weak and strong scaling both. In fact, scalability is one of the primary requirements for simulation tools that are used in mission critical environments. I've been the technical lead for many of these types of projects, and I have experience with most of the largest commercial, research, and open source simulation codes. The vast majority of parallel tools scale linearly well beyond the 10s of thousands of cores, and I do agree that at some point adding cores can cause bottleneck, but that point is usually far past the few hunder cores from your link for a significant portion of engineering applications.

It is common knowledge in the field to optimize a simulation for the largest number of cores it can efficiently use, so simulation cases are not just blindly thrown more cores without a justification given by the scalability. Your initial claim that parallelization doesn't get you much is flawed, parallelization is the only thing that enables scalable engineering analysis.


It's very, very hard to be memory bandwidth bottlenecked if your program is not parallelized, even on consumer hardware. A 5Ghz cpu core with dual-channel DDR5 might get 75 GB/s of memory bandwidth. That's 15 bytes per clock cycle. Prosumer hardware? Maybe 60 bytes per clock cycle. No way one cpu core can keep up with that.


Memory bandwidth can be increased by parallelization though. E.g. MPI (Message Passing Interface) is one of the major libraries in parallel programming and supercomputing and deals with parallelization across multiple machines.


I’m way-above-average interested in concurrent programming but this 600+ page brick will probably remain on my reading list until I am stranded on a deserted island. Did anyone here read the whole thing? Can one make a reasonable summary or is this more of a lexicon of different techniques?


Stranded on a deserted island is bit radical, but I recommend going to a place with no internet connection for a week or so. I though I had a long reading problem. Turns out, I have an internet problem.


The circuit in my apartment that my wifi is on keeps flipping so my connection has been frustratingly intermittent.

I've gotten more reading and cleaning done in the last week than I have in maybe years.


I didn't realize I also was a tambourine_man -- you literally just described my recent realization.

I'm moving this week and won't have internet for a few days. I look forward to the relative break.


If you still have access to cellular internet, it's no good, in my experience. There needs to be no internet.


some 40 years ago in the Netherlands we had no TV broadcast at night because people should be in bed.

I wonder if an ISP could forge a product like noon till 7 (or even 4am till 9am) when few people use bandwidth. Perhaps combined with a very slow connection the rest of the day.


> concurrent programming

If you're interested in concurrent programming this book won't give you much. The topics focus on the parallelization of a sequential algorithm and go into detail about things like synchronization (locking, barriers), important HW details (like caches and CPU pipelining) and algorithmic approaches. Imagine a weather simulation and not concurrent requests to a web server.


Good to know! Yes I’m less interested in “the art of programming but parallel” (partly because I don’t have PhD level ambitions) and more interested in how to effectively program day-to-day boring stuff that also needs concurrency.

My assessment is that the latter is nowhere near “solved” (hesitant to use that word because craft unlike eg proofs is about trade offs), in the sense that concurrency differs a lot across our tools (languages, runtimes etc) and it even differs quite substantially within the main paradigms, like coroutines, async, green threads, native threading etc.

If we compare with other advanced language features like say memory management, we’ve come much further, imo (GC, RAII, ref counting, manual are pretty much all well understood as well as the stack-heap duality, adopted basically universally). With concurrency we have, if we’re being generous, merely mutices as the common ground. Even “simple” notification mechanisms and ownership transfer across tasks would vary greatly across languages and often be quite contrived.


Turing equivalence says that all logic can be translated to other paradigms. But thread interactions are not logic they are based in real time. So the differences in how they treat time will show up as differences in the implementation. I guess you could create a worst of all of the above polymorphic abstraction that looks the same no matter which way you’re using it, but I don’t know if we would benefit much from that.


I've read an earlier edition. I would say it is more of a overview of the different concerns and design patterns that someone doing parallel programming would find useful. For instance, it gives you an overview of how CPU caches work, and uses that to motivate why doing synchronization is expensive. It uses this to both motivate doing as little synchronization as possible, and giving you ways to design systems which don't need (as much) synchronization, as well as to explain why RCU works so well as long as you don't need to update the synchronized structure often.

If you want a good overview, I'd recommend reading the roadmap in section 1.1. Also, I know a 600 page book (400 pages if you exclude appendixes) is a bit long, but I really enjoyed both the material presented and the style in which it was written. Hopefully that makes the length feel a bit less intimidating.


Check out the table of contents.

I did a quick flip through and realized that I'm never going to be doing low level multithreading, so I don't need to deal with OS layer stuff. There were some other ideas too, might be worth flipping to relevant areas.

Heck, if I do end up using it, I'll likely read a summary when I'm dealing with it.

There are some ideas that I havent heard of, which was alright, but again, since my current language(python) handles it and I am used to doing multithreading using those libraries, I don't get a ton of value out of reading the fundamentals. (opportunity cost)


I do a lot of it, but never on the low-end.

99% of mine, is responding to closures for network events and device responses.

When you do that, synchronizing is one way to deal with things (wait for the other thing to finish), or completion testing (is the thing ready for the next step?). Basically, they are the same thing.

You are also not always guaranteed a standard context, but modern languages make it easy to hook to one. In the "old days," we used to use RefCons (Reference Context hooks).


There's no need to read the whole thing. Read a chapter you're interested in and go back to the book once you're interested in another chapter. It's not a novel.


the PDF I opened was almost 1000 pages. And yes, from what I'm viewing in the Table of Contents, this is more of a comprehensive textbook covering almost every corner of the topic. I don't think it's meant to be read linearly (ha), nor is it expected to be binged in a few large sittings. Definitely meant for professionals or very motivated students more than a general hobbyist.


If one thousand of us read one page each we can be done in 15 minutes!

Then we just have to sync our knowledge.


I was going to read it, but then i played a round of Qwitzatteracht, the golf game, instead.


Functional programming can be a great way to handle parallel programming in a sane way. See the Futhark language [1], for example, that accepts high-level constructs like map and reduce, then converts them to the appropriate machine code, either on the CPU or the GPU.

[1] https://futhark-lang.org/


This looks brutal, for a lot of people John Reppy's book Concurrent Programming in ML (as in SML not Machine Learning) is going to be much more accessible. Pick the CSP-style library in the programming language of your choice.

Go with goroutines and channels

Clojure with core.async

F# with Hopac

It would be a very interesting project to roll your own in C# using Microsoft Robotics Studio's CCR (Coordination and Concurrency Runtime) (though I speculate those are buffered channels by default).


There are multiple replies like this one, but it's a bit shocking to see that on Hacker News people don't know the difference between concurrent programming and parallel programming.

Concurrency means that you can have multiple tasks running in the same time period, Parallelism means you have multiple tasks running at the same time.

The most obvious demonstration of this is that you can (and many languages do) have single threaded concurrency.

I did some grad course work in parallel programming and there's really no way to not make it "brutal", because to really do it in a way that increases performance you need to really understand some low-level performance issues.


Correct me if I’m wrong, but isn’t concurrency enough for “most” use-cases? When does one really need true parallelism?


Concurrency is typically a great solution of IO bound tasks, which is why it figures so prominently in modern day webdev. It's also essential for any UIs in which case you don't want a single threaded process to be blocking user interaction just because it's running a task that takes time to complete.

Parallelism is used to solve CPU bound problems. Obvious examples are things like efficient matrix multiplication. However this is what make Parallel programming so hard. There's a lot of nuance working effectively with multiple cores/threads that makes even "embarrassingly parallel" problems sometimes fail to see benefits from naive parallel programming (for example if your parallel solution requires something as little as more frequent visits to l2 cache instead of l1 you can see performance degradation).

So parallelism and concurrency ultimately solve two very different domains of problems.


I don't think IO bound and CPU bound problems are the correct way of differentiating between the two. Parallel programming deals with real-time problems with data dependencies and synchronization requirements. While concurrent programming deals with independent computations, where progress or completion of one does not depend on the progress of another.

I agree however they do ultimateively solve two very different domains of problems and there is a lot of confusion going on here.

On a side note: matrix multiplcation is generally not a CPU bound problem, but a memory bound one.


Parallelism is only about performance, that's it. If you need something to go faster, parallelism is an option.

Looking into the future, parallelism is one of the only remaining techniques for scaling classical computing. Processors have stopped getting faster. Instead, they're getting fatter (more cores).


Concurrenceny deals with what can be parallel, because they don't share data dependencies. For example, a map() function is trivially a statement of concurrency: each item needs to have a function applied to it, and assuming pure functions, that can all be done concurrently.

Parallel computing cares about what should be parallel, i.e., actually implementing parallelism. For most programmers, this job can (and probably should) be left to the scheduler, whose job is to translate concurrency into reasonably sane parallelism.

The scheduler comes with overhead, though, that can be avoided with hand-spun parallelism. I like to think about it like manual memory management: for most programs and programmers, using a garbage collector that a memory management expert wrote is easier/better than manually allocating and freeing memory, but there are performance gains to be had if you don't.


https://github.com/Hopac/Hopac is such an impressive piece of software. Too bad it never really took off like it deserved but with more popular competition like rx or just tasks/async (which is enough for most stuff) pretty unavoidable.


Etmology of ML: "Meta Language"


Concurrent state management is hard, and I remain disappointed that software transactional memory -- and pushing a DB-style approach to non-DB workloads generally -- hasn't caught on.

I think most "application" type programs would benefit from being able to manage state in terms of higher level transactional operations and let their runtime take care of serializing and avoiding deadlock and race conditions. Developers in those spaces have come to rely on garbage collection in their runtimes, I see no reason why they couldn't come to also rely on a fully isolated MVCC transaction model (and only get access to non-transactional memory in exceptional circumstances). Bonus points if said transactions tie back to the persistent RDBMS transaction as well.

There are too many minefields in manual lock management and the tools remain fairly low level. Ownership management in languages like Rust helps, following a discipline like an actor/CSP approach helps, etc. but in the end if there's shared state and there's parallel work, there's potential for trouble.


You might be interested in Joe Duffy’s retrospective on Microsoft’s failed experiment to integrate STM into .NET:

https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on...


You really only need one paragraph from that:

> What do we do with atomic blocks that do not simply consist of pure memory reads and writes? (In other words, the majority of blocks of code written today.)

If you could just get programmers to stop mutating, you could get STM (which incidentally would give you back the ability to mutate)


Recently learned about Verse: https://dev.epicgames.com/documentation/en-us/uefn/verse-lan...

It features "speculative execution" which practically is STM. Known names like Tim Sweeney and Simon Peyton Jones are behind that.


Thanks, skimmed a bit of the first half but will get into it more after my work day.


So, going through this a bit more on lunch break; I think the root of some of the disillusionment faced here -- and probably the lack of success in STM at this level generally -- is that they tried to do too much.

I don't necessarily want the whole memory model of the runtime or VM to offer transactions necessarily. What I think is a good idea is to offer an overall framework -- within which higher level applications can be written -- that brings transactional semantics with it. Basically a set of collections and data transformation and communications and process coordination libraries that work in harmony with an underlying MVCC storage layer -- rather than baking a transactional atomic keyword down to the syntactical level of the language or the memory model of the VM/runtime.

Put another way: I wouldn't necessarily give users the STM facilities. I would use lower level STM facilities to construct a higher level toolkit, and only expose that. Basically an RDBMS in-process -- without the SQL language boundary -- to be frank.

Yes, users could escape it easily, and start doing inconsistent things. But that's on them, same as any other framework.

Provide the pattern and tools in a nice coherent box and don't try to take over the whole world.


> Yes, users could escape it easily, and start doing inconsistent things. But that's on them, same as any other framework.

We're already there! You don't need an STM framework in order to leave consistency up to the user.


It's a good day when I get to use STM.

While the primary benefit is in thinking "these things should happen together or not at all" rather than thinking about locks, there's another feature I always forget about, which is retrying.

Retrying let's you 'nope' out of a transaction, and try again as soon as something changes without busy-waiting.


This reminds me when I was going through the YC accelerator in 2012.

We were building a web-based email client, and PG didn’t like the idea. He pulled our team aside during one of the batch-wide Tues night dinners and suggested we pivot to building something that could take single threaded programs and quickly/easily make them multi-threaded.

No one on our team knew anything about threading (none of us had even graduated college). So we kept working on the email client, now defunct.

PG brought us in a back room where the Stripe founders were waiting (I think they were speaking that night). We were brainstorming ideas with them, and one of the brothers recommended we build something having to do with phones and calling people (don’t remember the idea anymore).

PG was not very happy when he heard we weren’t going to pivot and kept on working on a new email client :)


OTOH PG suggesting that some young people with zero experience with multithreading pivot to searching for the grail is, just, daft. A topic which computer science PhD's have failed to find a solution to, despite decades of research. Not sure the world needed another web-based email client, but at least that project had some non-zero chance of success.


That's such a random suggestion. To switch from building a user application to building a tool for devs...that is in no way related to what you were working on?


It sounds to me that pg had realised there wasn't a future in their web-based email, but thought they had good programming chops and would be able to produce something useful.


If you had something in 2012 that could magically convert a single threaded program to a parallel one, it would be pretty useful. Seems incredibly difficult to build, though.


Do you think a Philosopher's Stone might have some value, too? I think Y Combinator should look into it. Maybe an Elixir of Immortality?

An automatically parallelizing compiler that works in general is one of the holy grails of computer science. They have been extensively researched for decades and we have little to show for it. If anyone managed to make one it would instantly render the entire computer industry obsolete.

We have systems that work in specialized cases with a whole lot of manual intervention. Even then, the executable is often less efficient than the serial version.


I also think it is an unrealistic suggestion, in case that didn’t come through in my first comment! Haha.

But I’m sure the hoped-for result was just another one of those specialized tools, specialized for some Y Combinator niche. Which is also probably an unrealistic idea for a team that isn’t interested in that sort of thing, but at least isn’t totally ridiculously stupid.


The biggest problem with parallel programming are the fixed communications costs. Threads are expensive to launch and stop. This then leads to thread pooling which is not a bad idea but then you have to make sure that nothing blocks the thread pool. Java is going to fix this problem with virtual threads.

Even if threads are cheap, you still have to decide when to parallelize something. If the section is short enough, splitting up the work will make your program slower because you have to wait for the new thread to be scheduled and then for the launching thread to be scheduled.

These are just the problems to get you started. They are not big barriers but they are big enough to make it not be the default.

The next problem is dynamic runtime behaviour. A parallel program can exhibit far more weird behaviours due to the nature of interleaved execution. This means that you will want a strong ownership model for your data and so far only Rust does it competently. Instanced locks are difficult to get right. Static locks are almost trivial but only if you can guarantee that your critical section never calls code that invokes the same lock. Recursive locks are a bad idea but not using them means you need to have two sets of methods. One is the public synchronized method that library users call and the other is the private unsynchronized method that actually does most of the work. It's very ugly to work with locks.

The other problem is that a lot of problems are genuinely difficult to parallelize. It is better to parallelize hierarchically where each hierarchy is still single threaded. This way you can maintain the illusion of mostly single threaded code. The alternative often requires a bespoke architecture. There are hardly any generalized solutions. You need to be an expert at parallel programming.


Here's course materials about the topic from my university, I assume they're available globally even if you can't sign up for the course: https://ppc.cs.aalto.fi/


Parallel programming is not hard; state management is hard. And for problems that are not embarrassingly parallel, unfortunately there is no silver bullet.


I have almost finished the book (to be honest, I have skipped the "Validation" chapter) and I found it quite interesting. In particular I liked that it also covered more "exotic" topics such as sequence locks, hazard pointers and RCU. However, in general, I found it a bit too much focussed on Linux kernel code. The code examples are full of kernel code macros, such as READ_ONCE and WRITE_ONCE, instead of C11/C++11 atomics. In the end, I appreciated it because it gave me a new perspective, but I'm not sure if this helpful for people who are not already familiar with parallel programming...


The symtoms that lead up to me learning about the need for locks in multi threaded servers was painful. That the state could change from one line of code to the next line, how could it change when I just set it a micro second ago? Because the user clicked the button twice and the call was executed simultaneously but in different threads because my new server had many cores. Then I learned about async programming, but I no longer remember what was so difficult about it. Just that it took about a year and I basically had to rewire how my brain visualize program execution.


> the state could change from one line of code to the next line

> I no longer remember what was so difficult about it

You or (your libraries) probably stopped mutating things. No mutation means you can rely on values you used several lines ago.

This is the perennial HN discussion whenever the word 'functional' comes up. "But x=x+1 is simple to teach to beginners" is the rallying cry. Well, this is why not.

Also, > the need for locks

This goes away if you don't mutate.


I like to philosophize about the universe, how at every state change a new universe is created. Like if the version of you I see is someone else, almost identical, and the version I interacted with have branched off into a new universe.


yet physical hardware has a finite amount of memory, so at some point you either stop your program of you have to modify a cell you already touched.


Mutation isn't confusing. On its own. Multiple references isn't confusing. On its own.

The observation, which I don't think I saw being made twenty years ago (I could be wrong) is that you shouldn't mix these two things. No multiple references with mutation. In theory you can safely do so in serial programs if you were careful enough, in practice you won't be careful enough and we should write fewer serial programs than we do, so why not reject this outright.

Rust is one particular (and notably successful) attempt to make a programming language about this big idea, but it's the idea in Val and several other newer languages. There are a lot of unknowns about the best way to approach this, but "Just pretend it's not important" is not a correct answer.


My point was simply that at some point the buck stops and someone has to write the code to mutate a memory cell.

Regarding your point, shared xor mutable is nice, but databases are an example of a shared mutable (and even concurrent!) yet safe resource, so other models are possible.


> databases are an example

Yes, it is possible because programmers have accepted the bargain that they must write purely-functional SQL.

Then the db is able to decide how best to optimise/apply/retry/copy-on-write/abort as necessary.

If SQL allowed you to directly mutate things, it would break everything. You couldn't 'LIMIT 10' if your loop might want to increment some data in row 40082.


I took the parallel programming class back in college. Even though we implemented several good examples using parallelism like matrix multiplications and different methods of achieving it using the GPU or distributed computers, the main goal of the class was to analyze and see what problems could be processed in parallel and also how to divide the problem in small chunks in order to do so. It was a lot of fun.


Has dang got a day off?

Previous discussions:-

https://news.ycombinator.com/item?id=22030928 (2020)

https://news.ycombinator.com/item?id=34859102 HN discussion (2023)


Ah, the old debate and approach. Centralize the control, Decentralize problem handling, centralize the knowledge needed for untangling parallelism, realize you just created a centralized authority, undoing the parallelism.

Then there is the physics engine view to all things. Data is just sand washing against the shoreline of parallel computation instances and interaction, is just sticking these dataparticles together, taking them out of the general flow to be interacted on in unison, carrying the resolution authority within them as long as they are "clumbed" together. Data is just particles, the programs interacting on it are just a limited number of program cells, traversing it and interaction clumps it together, making the processing of it slower, but also centralized to computation node. A centralized node, processing a huge clump of interacting data, even throws a sort of "relativity" shadow, as other, smaller, non interacting data, is processed faster and leaping ahead in the interactions.


It is not hard if you use proper libraries. So Take a look into Scala ZIO library, you can solve any concurrency problem with an ease. With the proper toolset which ZIO provides, you can tackle any consurrency problem in typesafe way, and mostly correct if it compiles :)


Thanks for another thing I need to add to my reading list in addition to The Art of Multiprocessor Programming.

I am really interested in parallel, asynchronous, multithreading, coroutine, futures programming so it's what I spend my days thinking about and blogging about it. I hope you sense my excitement in this comment about this topic. I'm looking for a programming model that parallelises easily and doesn't require much effort, so this PDF seems relevant to me. I really should try be a user of languages like Erlang, Inko, Pony and Go but I am too interested in the mechanism of these languages!

I am also learning from Erlang and Go, nginx and LMAX disruptor.

I don't focus on number crunching parallelisation, I let libraries and frameworks parallelise matrix multiplication such as BLAS. I'm interested in rote system parallelisation architecture. For example, PHP and nodejs is not a parallel language but how PHP is hosted in FastCGI processes means it can be executed multiple times by nginx so it is in effect parallel across requests. Unfortunately PHP and nodejs cannot create threads or use a thread pool within a request.

I want heavy CPU tasks of a request to not block other requests or the event loop and heavy IO requests to not block the event loop. I am a pre-beginner in Rust but I think you can use Rayon for CPU heavy tasks and Tokio for async IO parallelisation.

Here's a system diagram that I'm thinking about lately: https://github.com/samsquire/ideas5/blob/main/NonblockingRun...

The design is that we have three groupings of thread types. The application starts up some application threads which are not associated with a request, these service multiconsumer multiproducer thread safe ringbuffers in lightweight threads with a Go-erlang-like lightweight process runtime. (My simple lightweight thread runtime is https://github.com/samsquire/preemptible-thread) We also multiplex multiple network clients sockets across a set number of kernel threads which I call control threads. Their responsibility is to dispatch work to a work stealing thread pool ASAP which has its own group of threads. So we pay a thread synchronization cost ONCE per IO which is the dispatch from the control thread to a thread pool thread. (Presumably this is fast, because the thread pool threads are all looping on a submission queue)

We split all IO and CPU tasks into two halves: submit and handle reply. I assume you can use liburing or epoll in the control threads. The same with CPU tasks and use ringbuffers to communicate between threads. We can always serve client's requests because we're never blocked on handling someone else's request. The control thread is always unblocked.

I think this article is good regarding Python's asyncio story: https://charlesleifer.com/blog/asyncio/

I think the best multithreaded architecture is to never rely on synchronization, because it doesn't scale. Try and separate your task so the work is more like a tree than a graph, so that you don't need to communicate between branches. You can shard your data and work independently and merge at the end, similar to mapreduce.


> I am really interested in parallel, asynchronous, multithreading, coroutine, futures programming so it's what I spend my days thinking about and blogging about it.

That's not parallel programming though. Parallell programming deals with the parallelization of a single sequential algorithm or program (e.g. weather simulation) across multiple threads, CPU or machines, usually with the requirement of real time synchronization. When you paralellize independent tasks (async, coroutines, futures) a whole lot of problems just don't exist and others go more into focus.


What would you call what I'm working on, it is "parallel" but probably doesn't come under "parallel programming" in the literature. I'm still interested in it though, just haven't got around to it yet, it's a large space.

I'm not working on novel parallel algorithms that solve computer science problems, I am interested in parallelism as a general principle for coordinating systems activities.

When I get time to delve into parallel algorithms then I'll work on that, at the moment there's a lot of work in just cordinating and scheduling parallel tasks.

I am also interested in this area of database internals such as parallel query execution and concurrency control and wait/lockfree algorithms

I feel it is complicated and wide space and we deceive ourselves to think we know everything or hubris, which I expressly avoid doing.


What you're looking at is pretty much concurrent programming.

https://en.wikipedia.org/wiki/Concurrent_computing


“Interaction nets” might be what you are looking for. I believe there were a couple of hackernews threads about it.


My answers to the two questions in the post title: "Yes" and "Read Concurrent Data Processing in Elixir [1]"

I have dealt with concurrency in PHP (good times with Laravel Horizon) in a couple of different ways and in NodeJS (never feels good, to me). The BEAM has some great primitives for concurrency and the book mentioned above walks the reader through them at a good pace. I read it cover to cover and it really enriched my mental model of how to use concurrency and async.

[1] https://pragprog.com/titles/sgdpelixir/concurrent-data-proce...


Concurrency is not the same as parallelism. You can have single threaded concurrent programs.


Oh yes, good call! Still, was a good book...


An impressive work; got yet another 40 pages since Dec. 2021 edition.


Having seen this years ago, my favorite part is still the comic that says "do I out things of Look! can order." in section 15.1.1 "Why Hardware Misordering?".


I wish it included a big section on GPU programming.

Still a great read for people primary interested in GPGPU though, since a lot of knowledge transfers well.


It's not, if you are ready to copy your data around a lot, aka functional programming.


Wow


Hey guys look, it's Owen Wilson.


Only 2 ways into parallelism:

- Arrays of 64 byte Structures with C. (client)

- Java. (server)

Everything else is not atomic.


This makes no sense.

- Arrays of 64 byte Structures with C. (client)

All you need for parallelism is to spawn threads and give each thread a range of an array to work on. You don't need each element to be atomic because you aren't synchronizing on each element.

What makes you think you need specific data and a specific language?

- Java. (server)

This makes even less sense. A language doesn't matter and how it's used doesn't matter.

Where did you get this idea? Can you link something?

Everything else is not atomic.

I think you don't understand what parallelism means. Read about openMP, it is very simple.


>>Is Parallel Programming Hard, and, If So, What Can You Do About It? I confess I didn't read TFA but the title made me think of something that hopefully some people here are well placed to take further. In a programming language I use there is an "each" function. For example in pseudo-java given a function f(int i). result = f each List<Integer>. Runs f over each integer. To make this multi-threaded you can simply say f peach listOfNumbers. Today in java you can do similar using streams: listOfNumbers.parallelStream().forEach() but for anyone here creating a new language I would suggest parallelising the language primitive forEach loop by default. Given the number of CPUs and that the machine is better selecting the optimal threading distribution I think that would be the better choice. This seems a small simple change but the hard part about parallel programming is often getting people to use it. By making it totally invisible, it's "easy".


I don't know of any programming language that does that as a language-first primitive. But this is basically slapping a `#pragma openmp parallel for` before your C for loop. In Rust there is the crate `rayon` that takes advantage of the language-level trait system to "augment" any (existing) iterable with a `par_iter()` method that effectively does what you describe: when included in the same module you can simply `for foo in bar.par_iter()` in place of `for foo in bar.iter()` to parallelize your iterations.

I suspect any language wanting to offer this as the default behavior for `for` loops will end up being a language like Java, Erlang, or SmallTalk - the language spec includes a "virtual" runtime that allows it to assume such flexibility in behavior, and capacity to "create/run threads/processes" that otherwise requires an OS (as in, you start leaving the perimeter of a programming language).


It's only easy if the items that are iterated over are entirely self contained, and the code in the loop body also does not need (write-)access to any other state that might be shared between threads. Rust has such limitations built into the language, but this is also one of the points that make Rust "difficult" (because you actually need to think first about how you organize your data to be "multi-threading-compatible", and that's the hard part, not adding a parallel for-each to a language.


That’s only safety. It still does not guarantee that performance would improve.


A built-in parallel foreach can't guarantee better performance because the hardware isn't there. The parallel part is scheduled on normal threads by the kernel and this is too unpredictable.

But if each OS-scheduled CPU was really a cluster of for instance 1 fast + 64 slow cores then it could have special instructions to split a low-level loop on these slow cores and to facilitate blocking and rejoining back into a single stream. A 'parallel branch' instruction could take index, limit, and estimated instruction count and the CPU itself could decide whether to run it in a single fast stream or in parallel on slow cores.

There's some technical challenges like the size of a register file (so probably uninterruptible), but this may be where we're going since there's diminishing returns from making a single CPU faster and there's lots of opportunity for parallel processing that GPUs are too clumsy for.


That's pretty much the usual GPU+CPU workflow... Only you would have less badwith and latency issues.

Still, deciding where to run such things are not trivial, the time it takes to decide what would be faster could actually be more than the time it takes to just run it.


So instead we would have ridiculous amount of problems with cache and scheduling. I bet most of the time sequential cose would be “faster”, due to the parallel overheads, the cache misses, and the scheduling that obviously can’t fit every loop scenario.

And this is assuming that the code in the loop is thread safe, otherwise you will summon every concurrency bug on the planet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: