The Trouble with Multicore

David Patterson has written a nice article about the advent of multicore processors.

See Patterson is right that multicore systems are hard to program, but that isn’t the biggest problem with multicore processor chips.  The real problem is architectural – their memory bandwidth doesn’t scale with the number of cores.
I’ve been programming multiprocessor systems since 1985 or so.  At the Digital Systems Research Center we built a series of multiprocessor workstations with up to 11 VAX cores. Later I worked at SiCortex where we built machines with up to 5832 cores.
By the way.  I know that people say “core” when they mean “processor” and they say “processor” when they mean “chip”.  I find this confusing.  My answer is to use “chip” and “core” and avoid the overloaded “processor”.
At Digital, we thought multiple threads in a shared memory were the right way to code parallel programs.  We were wrong.  Threads and shared memory are a seductive and evil idea.  The problem is that it is nearly impossible to write correct programs.  There are people who can, like Leslie Lamport, Butler Lampson, and Maurice Herlihy, but they can only explain what they do to people who are almost as smart as they.  This leaves the rest of us on the other side of a large chasm.  Worse, some of us <think> we understand it and the result is a large pool of programs that work by luck, typically only on the platform that the programmer happens to have on their desk.
Threads and locks are a failed model.  Their proponents have has 25 years to explain it, and it is too hard.  Let us try something else.
The something else is distributed memory – a cluster.  Lots of cores, each with its own memory, connected by a fast network.  This model, in the last 15 years, has been spectacularly successful in the High Performance Computing community.  Ordinary scientists and engineers manage to write useful parallel programs using programming models like MPI and OpenMP without necessarily being wizard programmers, although that does help. Distributed memory parallel programs tend to be easier to write and easier to get right than the complicated mess of threads and locks from the SMP (symmetric multiprocessing) community.
The other huge advantage of clusters over shared memory machines is that the model scales without heroics.  The memory bandwidth and memory capacity scale with the number of cores. It is possible to build fast low-latency interconnect fabrics that scale fairly well. Clusters are <also> a good match for programs in another way.  Patterson cites the sorry history of failed parallel processor companies, but he didn’t mention that every one of them had a unique and idiosyncratic architecture, for which programs had to be rewritten.  The lifetime of a successful application is 10 or 15 years. It cannot be rewritten for every new computer architecture, or every new company that comes along.  The processing model for clusters has not required so much rewriting.  A cluster that runs Unix or Linux, and supports C, Fortran, and MPI, can run the existing applications.
So my modest suggestion to Intel is to not bother with larger SMP multicore chips. No one knows how to program them.  Instead, give us larger and larger clusters on a chip, with good communications off-chip so we can build super-clusters.  Don’t bother with shared
memory, it is hard anyway. Give us distributed memory that scales with the number of cores. I too am waiting for breakthroughs in programming models that will let the rest of us more easily write parallel programs, but we already have a model that works just fine for systems up to around 1000 cores.  No need to rewrite the world.
Side notes:
Someone is going to point out how wonderful shared memory is, because you can communicate by simply passing a reference.  Um.  The underlying hardware is going to copy the data <anyway> to move it to the cache of a different core.  If you just do the copy, you are not really doing much extra work, and you get the performance benefits of not having to lock everything.
Yes, I dislike MPI.  Its chief benefit is that it actually works really well, at least as long as you stick to a reasonable subser.  I really like SHMEM, and would prefer even a subset of that.