Computing – Page 3 – A Rubble of Bits

Another thing not to do

At the day job, I’ve been writing a new version of nbd-client. Instead of handing an open tcp socket to the kernel, it hands the kernel one end of a unix domain socket and keeps the other end for itself. This creates a block device where the data is managed by a user mode program on the same system.
In regular nbd-client, the last thing the program does is call ioctl(fd, NBD_DO_IT), which doesn’t return. The thread is used by the device driver to read and write the socket without blocking other activity in the block layer.
Because I need the program around to do other work, I called pthread_create to make a thread to call the ioctl.
Then I ran my program under gdb (as root!).
In another window, I typed dd if=/dev/nbd0 bs=4096 count=1
In the gdb window I saw
nbd-userland.c:525: server_read_fn: Assertion `0′ failed.
and my dd hung, and the gdb hung, and neither could be killed by ^C
I was able to get control back by using the usual big hammer, kill -9 <gdb>
So what happened? My user mode thread hit an assertion, and gave control to gdb, which tried to halt the other threads in the process, which didn’t work because the thread in the middle of the ioctl was in the middle of something uninterruptible, and the gdb thread trying to do this also became uninterruptible while waiting.
It is going to be hard to debug this program like this.
The fix, however, is fairly clear: use fork(2) instead of pthread_create() to create a thread to call ioctl. It will be isolated from the part of the program hitting the assertion.
Older and wiser,
Larry
By the way, when you are trying to figure out where processes are stuck, look at the “wchan” field of ps axl. It will be a kernel symbol that will give you a clue about what the thread is waiting for.
UPDATE
Experience is what lets you recognize a mistake when you make it again.
The underlying bug was sending too much data on the wire. Like this:
struct network_request_header {
uint64_t offset;
uint32_t size;
};
write(fd, net_request, sizeof(struct network_request_header);
Well, no. sizeof(struct network_request_header) turns out to be 16, rather than, say, 12. If you think about it, this makes perfect sense, because otherwise an array of these things would have unaligned uint64_t’s every other time. You can’t do network I/O this way, especially if the program on the other end uses a different language or different compiler.
gdb, it turns out, has a feature: __attribute__((packed)) that makes this work, but it is not portable to other compilers.

Home Networking Troubleshooting

Sometimes a technological scramble is triggered by the most mundane events. In this case, the season finale of “X Factor”.
Last night, there was a special church choir rehearsal for the Christmas Eve services, and all seven of Win’s and my kids went. Since the rehearsal would overlap the broadcast finale of X Factor, Erica asked Win to record it. Maybe the appearance of 1 Direction had something to do with it as well.
We used to have Replay TVs to solve things like this, and cable TV to deliver the bits, but the conversion to digital TV and the crazy anti-customer behavior of Comcast has changed all that. We don’t get cable, and the TV is hooked up to an antenna. We’ve also got a Silicon Dust HDHomeRun network tuner connected to the antenna on my front porch, so we can watch TV on any computer as well. Win has the copy of EyeTV that came with the HDHomeRun, and he planned to record the show.
About an hour before air time, he called to ask me about video artifacts and bad audio. I said I’d take a look.
I used hdhomerunner (a now lost Google Code project to develop an open source HDHomeRun control program) and directed the video to VLC running on my Macbook Pro. Indeed, the video was blocky and the audio spotty.
I power cycled the HDHomeRun, replaced the ethernet cable, and plugged it into a different switch port on the 16-port gigE switch. No change. I looked for firmware upgrades, and found the device running 4-year old firmware. The upgrade went smoothly, but there was no change in video quality.
After sitting and swiveling back and forth for a while, I went back downstairs and plugged the device into the 100 Mbps switch instead of the 1000 Mbps switch. I had some vague memory that the negotiation doesn’t always work right. This fixed the problem and I was able to watch good video and audio with VLC.
Win called back to report his video was still breaking up. This suggested some other networking problem between the houses.
Backgound. Win and I are neighbors, and we have a conduit between the houses with a couple of outdoor rated Cat V cables and a 6-fiber multimode fiber. One pair of fibers are connected to 1000base-SX media converters at the two ends and plugged into the house gigE switches.
I remembered once setting up netperf on the home servers, and indeed it was still installed. Win’s house to mine reported 918 Mbps, but mine to Win’s reported 16! At this point, there wasn’t much time to debug the networking, and X Factor was about to start.
I remembered that VLC can record an input video stream, and set that up to record the program on my Macbook. (I had 45 GB free on disk, and the program was running at 2 Megabytes/second, so it would take 14 GB for the two hours. No doubt there is a way to transcode, but not enough time to learn how to do it!)
The VLC recording froze once, at about the one hour point, but I only missed a couple of minutes. I copied the files to an external USB drive for sneakernet delivery.
This morning, Win and I started taking a look at the networking. First, we got netperf running on our respective Macbook and iMacs, in order to figure out if the link was bad or one of the home servers. I was able to talk both ways to my server at about 600 Mbps, and Win to his at about 95 Mbps. Win’s results are explained by a fast Ethernet hop somewhere, but all these rates are way above the pitiful 16 Mbps across the fiber.
Next Win wiggled his connectors, dropping the path to about 6 Mbps. We swapped the transmit and receive fibers at both ends, and the direction of the problem did not change. It was looking more and more like a bad media converter.
I was staring at the wiring in my basement, wondering if we could use the copper link as backup while waiting for parts. It never worked very well, but we did use it to cross connect our DMZs before the firewalls at one point. I found the cable, and found it plugged into the ethernet switch on the back of my FIOS router – with LINK active! Huh? What was it plugged into at Win’s end? He reported it plugged into a small switch, but that it wasn’t easy to tell what else was plugged in.
For experiment, we unplugged the copper link and … Win lost Internet access. Evidently (a) his routes were set to use the Serissa business FIOS rather than his home Comcast, and (b) the traffic was going over this moldy waterlogged CatV instead of our supposedly shiny gigabit fiber. Now the gears are turning. If we did have a loop in the switch topology, then it was entirely possible that one direction between the houses would use the fiber while the other direction would use the copper. I don’t know much about how these cheap switches figure out things like that. We tried unplugging the fiber, forcing all traffic onto copper, but the netperf results were much worse. ping seemed to work, and ping -c 1000 gave fairly good results, but ping -c 1500 had a lot of trouble. That would explain why, generally, ping and ssh seemed to work but netperf gave bad results.
We unplugged the copper and plugged the fiber back in, and after a few seconds, the asymmetrical performance resumed. I’ve placed an order for another media converter, and we’ll see if that fixes it. At least they now cost half as much as when we got the first pair!
So, there was a lot going on here.
The hdhomerun was plugged into a gigabit switch, and working poorly. Changing to fast Ethernet fixed that.
The topology loop was routing off-site traffic over a poor copper link, but it was working well enough that we didn’t notice.
The media converter is probably bad, working well in one direction but not the other, and probably that explains the poor video quality .
And Erica gets to watch 1 Direction.
How are just plain folks supposed to figure this stuff out?
UPDATE
The new media converter arrived… and didn’t fix the problem. Well we have a spare now! The actual problem was a bad 8-port switch in Win’s basement, which we belatedly figured out once ruling out the fiber. We could have tested the link standalone by plugging computers into both ends, but we did’t think of it. Does gigE need crossover cables to do that? Or is the magic echo cancellation make crossover cables unneccesary?

A Debugging Story

I’ve been working on fos at MIT CSAIL in recent months. fos is a factored operating system, in which the parts of the OS communicate by sending messages to each other, rather than by communicating by shared memory with locks and traps and so forth. The idea of fos is to make an OS for manycore chips that is more scalable than existing systems. It also permits system services to be elastic – to grow and shrink with demand, and it permits the OS to span more than one box, if you want.
The fos messaging system has several implementations. When you haven’t sent a message to a particular remote mailbox, you send it to the microkernel, which delivers it. If you keep on sending messages to the same place, then the system creates a shared page between the source and destination address spaces and messages can flow in user mode, which is faster. Messages that cross machine boundaries are handled by TCP/IP between proxy servers on each end.
I’ve been making the messaging system a bit more object oriented, so that in particular you can have multiple implementations of the user space shared message message transport, with different properties.After I got this to pass the regression tests, I checked it in and went on to other stuff.
Charles Gruenwald, one of the grad students, started using my code in the network stack, as part of a project to eliminate multiple copies of messages. (I added iovec support, which makes it easier to prepend headers to messages), and his tests were hanging. Charles was kind enough to give me a repeatable test case, so I was able to find two bugs. (And yes, I need to fix the regression tests so that they would have found these!)
Fine.
Next, Chris Johnson, another one of the grad students, picked up changes from Charles (and me) and his test program for memcached started to break.
All the above is just the setup. Chris and I spent about two days tracking this down…
Memcached is a multithreaded application that listens for IP connections, stores data, and gives it back later. It is used by some large scale websites like facebook.com to cache results that would be expensive to recompute.
When a client sends a data object to memcached for storage, memcached replies on the TCP connection with “STOREDrn”. On occasion, this 8 character message would get back to Chris’s client as “”, namely all binary 0’s. Since the git commits between working and not working were associated with my messaging code and the new iovec support, it seemed pretty likely that the problem was there. However, the problem occurred with <both> the new implementations of shared page messaging, so it couldn’t really be anything unique to one or the other. That left changes in the common code or in the iovec machinery.
Now fos is a research OS, and is somewhat lacking in modern conveniences, such as a debugger, even for library code in user mode. However, we have printf, and all the sources.
First, we added… When I say “we” I really mean Chris, because he is a vi/cscope user, and I am emacs/etags. I think he types faster too.
First we added a strncmp(“STORED”…) inside the message libraries to locate the case. When the string matched, we set a new global variable to indicate a case of interest. We couldn’t add printf to all the messaging code because it is used all over the place, by many system services. There would be too much output and general flakiness. Now, with the new global, we could effectively trace down into the messaging libraries, watching the “STORED” go by and printting if it disappeared…. which it did.
However, we got lots of disappearance messages, many due to other messages being sent. Since we also suspected the iovec machinery, we added printfs to print the number and sizes of the iovecs, and their contents. One of the places we came across was in the fos dispatch library, which is an rpc mechanism that prepends a header on an existing message. The iovec form of this does something like
struct iovec new_iovec[in_iovcnt + 1];
to allocate a variable length array on the stack. Now this is a feature added to the C language as part of ISO C99, and supported in GCC in C90 or C99 mode, but it makes me nervous. Just in case, we changed the declaration to
struct iovec new_iovec[10];
but it made no difference.
Eventually we found that the “STORED” was there on entry to a function called “sendData”, but had vanished before the sending. And there were no references to the buffer in the interim. This suggests that someone is using a pointer after freeing it, and the space has been reallocated to our data buffer, but then clobbered by someone else. All there was separating the “STORED” from the “” was a check of the fos name cache to see if the destination mailbox was still valid. More printfs established that the data vanished in exactly the case that the name cache entry had expired, requiring a fos message send to the name server to get a refreshed copy.
A search of the name server library revealed no obvious problem, but there was storage allocation in there, which might be relevant, if in fact the heap had gotten scrambled.
Overnight, I looked at all uses of malloc and free in the messaging library and they all seemed OK, but I thought this was an unlikely idea anyway because the failure happened with both implementations of shared page messaging.
This morning Chris and I had the idea of printing the region around the “STORED” to try and figure out if only our data was changed or if the change was some larger area. This was difficult to tell, because the local region of memory was mostly 0’s already. There was an ascii string a little before our code “suffix” that was also clobbered. We didn’t know what that was, but cscoping and grepping through the entire source tree located it as a name attached to a memcached data structure. It came to be nearby the “STORED” because memcached did a strdup of a procedure argument, which malloc’d space for the string out of the same general area of the heap. This clue meant that a larger region of the heap was being clobbered, but we still didn’t know how much.
One aspect, incidently, of this whole affair was that the problem always happened at the same virtual address: 0x709080. No idea why, but having a stable address makes it much easier to track.
Next, Chris added code to fill the 1024 bytes centered on 0x709080 with 0xFF, and printed what it looked like after the disappearance. Now this is just gutsy. We had no idea what data was there, or used by who, and we just overwrote it with the 0xFF pattern, hoping the system would survive long enough to print the “after” pattern. In fact it crashed immediately, but by changing the size of the 0xFF region, we learned that the clobber affected exactly 136 bytes, all 0’s except the first, which was 0x20.
Well 136 is an odd size. We grepped the whole code base, to look at any 136s, but did not find any.
Next, we wondered if the clobber might be made by someone calling memcpy or memset. Since the address was stable, we were able to add code to the memcpy library routine something like this:
if (ptr < 0x709080 && (ptr + size) > 0x709080) printf(arguments to memcpy)
But we didn’t get any printfs <at all>… including our own initialization of that space. We realized that gcc includes an “intrinsic” implementation of memcpy, which it will use when the actual arguments make it convenient .. such as knowlege that the pointers are 8 byte aligned and the length is a constant, or like that. Now it is possible to turn off the compiler intrinsic by using the -fno-builtins flag to the compiler, so we dug into the fos Makefiles to add this to CFLAGS.
Now we got printfs from memcpy, and a nearly immediate page fault caused by running out of stack space. It turns out that some variants of printf call memcpy internally, and we had managed a recursive loop. We also got way too much printout, because we had adding printing to the library copy of memcpy, used by all applications and services. We got out of that by having the memcpy test code check the magic global variable to see if we were inside the code region of interest as well as a second magic variable set only in the memcached application. We also added a call to print the return address of the caller of memcpy so we could identify who was making the call.
We didn’t find any useful memcpy calls, so we added the same logic to memset.
Widening the test for addresses to cover the entire page containing 709080 we found two 8 byte memset calls to the region right before 709080 but not including 709080. These calls came from inside the libevent library used by memcached to dispatch work. libevent was preparing a call to select(2). The nearby code was crealloc’ing the file descriptor bit masks and then using memset to zero them out before calling select. This seemed unrelated to our bug, since the memsets didn’t overlap our “STORED” buffer.
Now what? This could be a storage allocator usage problem, with someone using heap storage after calling free on it, or it could be a buffer overflow problem, with someone writing off the end of an array, but these things are difficult to find. We thought about replacing malloc with one that carefully checked for some error cases, by putting sentinels around allocated storage. Even worse, the problem could be that the page of memory had become shared with some other address space, at entirely different virtual addresses. After all, the suspect messaging code does things like that.
Someone said. “If we had a debugger, we could just use a watchpoint”. A watchpoint is a way of saying “let me know when this memory location is changed”. But we had no debugger. I thought, well, these x86 processors we are using have hardware to support watchpoints, how does it work?
Some work with google and the Linux kernel cross reference website showed that gdb implements watchpoints by using the linux ptrace system call, which in turn, through some elaborate machinery, eventually sets some debug registers deep in the x86 processor chip. At that point, once any program touches the watched location, the chip generates a debug interrupt, at which point the OS returns control to gdb, letting it explain to the user what happened.
Now we didn’t have gdb, and fos doesn’t have ptrace, and we’re not even running directly on x86 hardware, we’re running inside of a Xen virtual machine hosted by a linux OS, but how hard could it be?
We decided to implement support for hardware watchpoints in fos.
We added a new system call “set debug register”, with no security whatever. The user program just does this new syscall, passing raw bit values for the debug register. The microkernel takes the argument, and calls HYPERVISOR_set_debugreg(), which Xen thoughtfully supplies to do the heavy lifting. We added a second system call to read back the register.
A careful reading of the fos interrupt handlers seemed to say that the debug interrupt, while not expected to be used, did have a default handler in place that would print the machine registers and then crash.
Now, we called this new function to set a hardware watchpoint to 0x709080, and another to turn on the watchpoint control register. Nothing happened. We read back the registers, and they seemed to be set to the right bit values, according to wikipedia (and the Intel x86_64 processor reference manual). Now this could happen because we got the code wrong, because Xen didn’t in fact implement this functionality, or who knows. So we added another call to memcpy to overwrite the “STORED” ourselves, and we got an immediate crash dump.
This meant that the mechanism was working, but it wasn’t finding the clobber. That probably meant that whoever was doing the clobber was running on a different processor core, each of which has their own debug registers.
Now the right way to handle this is for the set_debugreg system call to send messages to all the other cores on the machine to set their debug registers, using inter-processor interrupts. fos doesn’t have any IPI, and in fact has no way to communicate to different cores in the microkernel. The only place that needs to do this is the scheduler, which works by locking and then enqueuing processes onto the scheduler data structures of other cores. No help to us.
But, all cores are running timer interrupts! So inside our “set debug register” system call, we copied the arguments into microkernel global variables, and set up an array of flags, one per core. The system call set all the flags to “true”. Now in the timer interrupt, every core would check the flag for itself, and if set, copy the values in the global to the core’s local debug registers, then clear the flag.
The system call would spin on the flags until they were all clear again, then return to user mode. This is a really hacky way of having all cores load the 0x709080 into their debug registers at the right moment.
Now this was a little bit of a hail mary. The x86 debug registers work by virtual addresses, so if the clobber were happening because the page was shared, and shared with a different VA, then we would not catch it.
But we did! We ran the test, which waited until the “STORED” was there, then set the debug registers for 0x709080, and proceded. We got a crash dump, and the return address on the exception stack was…libevent’s implementation of support for select(2), running in memcached, but in a different thread, on a different processor core than the thread sending “STORED”.
Now all we had was the program counter. We could identify the function by using “nm” to print the symbol table for the memcached executable, but getting to the source line of code is harder. We found useful switches in objdump, -d -S, which print a disassembled listing of the binary executable code, interspersed with the source code, provided the file was compiled with the -g flag. That took another spin through the fos Makefiles, which were using -g3, which is evidently some slightly different version of -g that is not compatible with objdump. Now we were able to see the offending source line as…
FD_ZERO(readset);
or similar. This is code that is zeroing the file descriptor bit vector about to be used in a call to select. This was not found by our instrumentation of memset because FD_ZERO was still apparently using a compiler intrinsic, just a straight line set of moveq instructions to zero 128 bytes, in the middle of which was our “STORED” buffer. I’m not sure if -fno-builtin didn’t work for this, or it was controlled by a different makefile for CFLAGS or what.
… FD_ZERO was zeroing 128 bytes of a buffer that had recently been allocated with only 8 bytes of memory.
Now here is another bit of unix/linux history, I think. When select was first defined, I think by the BSD folks at UC Berkeley, the sizes of the file descriptor bitmaps were variable, and needed to be only large enough to hold the maximum number of file descriptors under consideration. At some point, linux, blessed later by the POSIX standards committee made the size of select descriptor arrays fixed, with a system specific constant. In our case, the version of libevent we had was BSD derived, with variable size descriptor arrays, but calling into a select client that was POSIX derived, and expecting a (larger) fixed size.
Incidently, the 136 byte clobber was also now explained, the select code was FD_ZEROing both the readfd and the writefd arrays, which were 8 bytes apart in memory, leading to two overlapping 128 byte clobbers adding to 136 bytes.
The fix to this bug was updating the libevent select client to use fixed size descriptor arrays. This bug had nothing at all to do with the iovec or messaging code. We just happened to run into it there because the chance coincidence of our messaging buffer containing “STORED” being allocated right after the select descriptor arrays that were too short.
-Larry

Followup: My colleague Matteo Frigo reports:

FD_ZERO is written in assembly (the most misguided "optimization" ever?):
[from <bits/select.h> on glibc/amd64:]
# define __FD_ZERO(fdsp)
 do {                                                                       
   int __d0, __d1;                                                          
   __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS                        
                         : "=c" (__d0), "=D" (__d1)                         
                         : "a" (0), "0" (sizeof (fd_set)                    
                                         / sizeof (__fd_mask)),             
                           "1" (&__FDS_BITS (fdsp)[0])                      
                         : "memory");                                       
 } while (0)

Order of Operations – Evil and Pernicious

Back in November, my son came home with a 6th grade math test in which he lost a point because he put in parenthesis that were not strictly necessary, according to the order of operations.
Here’s the note I sent to the math teacher:
I’ve been meaning to write about this, but not getting around to it. I am moved to write because on Alex’ recent math test, he lost a point because he put in parenthesis that were not necessary due to the order of operations.
I’m not going to argue about the grading, which is fine given the syllabus, but rather I want to express my view that teaching order of operations at all is evil and pernicious.
The only correct way to handle math is to always put in all the parenthesis. Here’s why.
In 6th grade math, the order of operations is pretty simple, multiply and divide are “stronger” than addition and subtraction. Once you get to the rest of mathematics, and then to programming languages, the situation becomes impossible.
I hate to cite wikipedia, but this article is relevant.
http://en.wikipedia.org/wiki/Order_of_operations
Just look at the page and its examples and just the visual impression of vast complexity is there.
It is beyond dangerous to teach these things <and expect folks to remember them>. They won’t remember the details, but <will think they know>. Smart, capable engineers will write expressions, thinking they understand what they mean, and <they will be wrong>.
A few years ago I was working at SiCortex, and we built a custom chip with about 150 million transistors, as part of a supercomputer. The logic is expressed in the VHDL programming language, which like many, has a defined order of operations. An engineer did something quite innocuous, confusing the order of operations of logical OR and bitwise AND, and in consequence the mathematics expression meant something quite different than intended. This was caught quite by accident, but had it gone through, the cost would have been a half million dollar replacement chip mask and about 3 months of schedule.
I very strongly feel that order of operations is a quaint dated idea that we really need to stop teaching and stop depending on. If you always specify exactly what you mean by grouping operations with parenthesis, you and the computer will always agree about what the math means.
This also means that putting in the parenthesis, even if not needed, is a good idea, it makes the meaning of the expression clear without any risk. This sort of care should be applauded, not penalized!
Some programming languages, like LISP, get this right – they don’t allow chained operations at all, and have no need for order of operations. Of course they don’t even use infix operators. In LISP, one says (+ 3 4) or (+ (* 2 4) (* 5 6)) and there is never any confusion about it.
-Larry
PS Don’t get me started about mean, median, and mode. After 4th grade, has anyone actually used Mode?

Type conversion run wild

Many languages have the idea that if you assign a value of one type to a variable of another type, then the value will be converted to the same type as the variable. So in C, for example
float x = 2;
converts the integer “2” to a floating point “2.0” before assignment.
So far so good.
Today I received an email with the following header field:
From: java.lang.NullPointerException@248257-web11.element115.net
This is just outstanding! My best idea of how this happened is that a function intended to return a value of type email-address instead threw an exception, which was faithfully type-converted to an email address.
I will send a reply, just to see what happens.

The Trouble with Multicore

David Patterson has written a nice article about the advent of multicore processors.

See http://spectrum.ieee.org/computing/software/the-trouble-with-multicore Patterson is right that multicore systems are hard to program, but that isn’t the biggest problem with multicore processor chips. The real problem is architectural – their memory bandwidth doesn’t scale with the number of cores.

I’ve been programming multiprocessor systems since 1985 or so. At the Digital Systems Research Center we built a series of multiprocessor workstations with up to 11 VAX cores. Later I worked at SiCortex where we built machines with up to 5832 cores.

By the way. I know that people say “core” when they mean “processor” and they say “processor” when they mean “chip”. I find this confusing. My answer is to use “chip” and “core” and avoid the overloaded “processor”.

At Digital, we thought multiple threads in a shared memory were the right way to code parallel programs. We were wrong. Threads and shared memory are a seductive and evil idea. The problem is that it is nearly impossible to write correct programs. There are people who can, like Leslie Lamport, Butler Lampson, and Maurice Herlihy, but they can only explain what they do to people who are almost as smart as they. This leaves the rest of us on the other side of a large chasm. Worse, some of us <think> we understand it and the result is a large pool of programs that work by luck, typically only on the platform that the programmer happens to have on their desk.

Threads and locks are a failed model. Their proponents have has 25 years to explain it, and it is too hard. Let us try something else.

The something else is distributed memory – a cluster. Lots of cores, each with its own memory, connected by a fast network. This model, in the last 15 years, has been spectacularly successful in the High Performance Computing community. Ordinary scientists and engineers manage to write useful parallel programs using programming models like MPI and OpenMP without necessarily being wizard programmers, although that does help. Distributed memory parallel programs tend to be easier to write and easier to get right than the complicated mess of threads and locks from the SMP (symmetric multiprocessing) community.

The other huge advantage of clusters over shared memory machines is that the model scales without heroics. The memory bandwidth and memory capacity scale with the number of cores. It is possible to build fast low-latency interconnect fabrics that scale fairly well. Clusters are <also> a good match for programs in another way. Patterson cites the sorry history of failed parallel processor companies, but he didn’t mention that every one of them had a unique and idiosyncratic architecture, for which programs had to be rewritten. The lifetime of a successful application is 10 or 15 years. It cannot be rewritten for every new computer architecture, or every new company that comes along. The processing model for clusters has not required so much rewriting. A cluster that runs Unix or Linux, and supports C, Fortran, and MPI, can run the existing applications.

So my modest suggestion to Intel is to not bother with larger SMP multicore chips. No one knows how to program them. Instead, give us larger and larger clusters on a chip, with good communications off-chip so we can build super-clusters. Don’t bother with shared

memory, it is hard anyway. Give us distributed memory that scales with the number of cores. I too am waiting for breakthroughs in programming models that will let the rest of us more easily write parallel programs, but we already have a model that works just fine for systems up to around 1000 cores. No need to rewrite the world.

Side notes:

Someone is going to point out how wonderful shared memory is, because you can communicate by simply passing a reference. Um. The underlying hardware is going to copy the data <anyway> to move it to the cache of a different core. If you just do the copy, you are not really doing much extra work, and you get the performance benefits of not having to lock everything.

Yes, I dislike MPI. Its chief benefit is that it actually works really well, at least as long as you stick to a reasonable subser. I really like SHMEM, and would prefer even a subset of that.

Chuck Thacker wins ACM Turing Award

This week Chuck Thacker has won the ACM Turing Award. This is good.
The best article I’ve seen is this one from Microsoft:
Microsoft Press Release on Chuck Thacker
I had the privilege of working near Chuck at Xerox and for him at Digital, back in the day.
I started at Xerox as a grad student intern in 1977, working for Ted Strollo in the Systems Sciences Lab (home of Smalltalk, Alan Kay, Chuck Geshke, and John Warnock). My first project was a power line carrier communications modem. I got to <use> an Alto, which was by itself a transforming experience. Technically this project wasn’t that interesting, but it came out well enough that I was able to talk the lab into letting me stick around.
My next project was with John Shoch and the DARPA Bay Area Packet Radio network, which was a network of packet radios around the San Francisco Bay Area, at 100 to 400 Kbps. This was in 1978, mind you, a few years before WiFi. PARCs part of the project was to provide packet switching experience and my part of the project was to design the hardware to interface the Alto to the packet radios.
Rudyard Kipling said “An engineer can do for 10 cents what any fool can do for a dollar.” Chuck Thacker, the engineer’s engineer, could do for 10 cents what a mortal engineer couldn’t do at all. With the Alto, in the mid ’70s, that meant building a six MIPS minicomputer, 128 Kbytes of memory, 5 MB disk, and million pixel display, for $20,000 or so. I got to know the innards of the machine fairly well, designing the BBN-1822 interface for the packet radio, and writing the microcode and device driver for it. The Alto had extreme economy of design. The CPU executed a 32 bit microinstruction every 170 nanoseconds, and “hyper-threaded” between 16 micro tasks. The lowest priority task ran an emulator for whatever high level instruction set you wanted: Nova-like for BCPL, bytecodes for Mesa, and different bytecodes for Smalltalk. The other micro tasks were responsible for the Ethernet, the disk, the display, and whatever else got plugged in, like a laser printer controller, or a packet radio.
This capability let you design I/O device controllers that were much simpler than they had any right to be. The 1822 interface turned into a couple of shift registers and a couple of PROM-based state machines, plus a modest handful of microinstructions. (Dave Boggs of Ethernet fame taught me how to build the state machines.)
I was pretty young then, and I didn’t immediately realize how amazing this was. I had designed things like a tape drive interface for an Interdata and a color display controller, that took up entire boards, but this stuff was <tiny>. The whole Alto was like that. The disk controller was essentially the same, all datapath and no control. The disk microcode would wake up once a sector and ask itself “is this the right place to start transferring data?”
Alto: A personal computer
In 1981 I graduated, and landed a full time job in the PARC Computer Science Lab. My project was building the Etherphone, and by then Chuck was busy building the Dorado, an ECL based personal super-mini. I started slowly picking up Chuck’s design ethic:
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away” – Antoine de Saint-Exupery
In 1984 I followed Chuck and Bob Taylor to the Systems Research Center of Digital Equipment Corporation. We had 24×80 dumb terminals hooked up to a VAX-785 time sharing system. This was not the same thing as a personal Dorado at all. The first main project was the Firefly (I think the name suggestion was mine, actually), a multiprocessor workstation built out of commodity processors. The first version used the Motorola 68010. Chuck designed the hardware. I wrote the boot rom code. Around then, however, Digital came out with the MicroVAX chip, and we immediately started a redesign. The Firefly used a coherent memory system we called the “snoopy cache”, where each processor “snooped” on the bus traffic of the others to maintain a consistent view of memory. This scheme and its variations became the standard way to build small scale multiprocessors. I designed the MicroVAX CPU modules, Chuck designed the memory system and the display controller. The display controller was another minimalist creation – replacing the “standard” Digital display controller with one that did more and took half the space. He also threw in audio I/O, with, I think, two extra chips and some microcode. A typical Thacker design element for the CPU modules was his choice of a two-phase non-overlapping clock system. This let us use Earle latches implemented in 15 nanosecond 16L8 PALs for all the control logic, without needing any edge triggered registers or causing much heartache about timing.
Firefly: a multiprocessor workstation
After the Firefly, Chuck turned to networking, building, in 1987 or so, a 100 Mbps local area network called Autonet. I didn’t work on Autonet much beyond design discussions, but I came away as coinventor with Chuck on a routing patent. How cool is that?
Chuck’s next big idea, around 1988 or so, was to build a liquid cooled minimalist 200 MHz computer in a single ECL gate array. Bill Hamburgen of the Digital Western Research Lab knew how to do the liquid immersion cooling. Phil Petit of SRC worked on the CPU design. My piece was the level-1 cache modules, designed using 1K bit Gallium Arsenide SRAMs. This was a lot of fun. We never built it, because the project was overtaken by Alpha.
Digital’s Alpha chip was a technical tour de force. Chuck’s idea was to build multiprocessor development systems around the chip, to speed Digital’s time to market, and just maybe, to encourage a bit more minimalism among the Digital engineering community. At that time, the spec for Digital’s “BI” bus design for multiprocessors ran to some hundreds of pages. Chuck’s design for the coherent memory bus for the Alpha Demonstration Unit was 13 pages.
The Alpha demonstration unit: a high-performance multiprocessor
I designed the I/O system for the ADU. It was built with ECL100K, power dissipation no object, but very fast, and very clean signals. Chuck designed the memory system, and Dave Conroy designed the CPU module. Dave also embodies Chuck’s minimalist spirit. he kept a copy of the classic 5 tube AM receiver circuit on the wall of his cube, with the caption “If you want a job here, remove a part from this design”.
All American Five radio design
Wikipedia Article on All American Five
The ADU project probably saved a year time to market for Alpha products, and accelerated around a billion dollars in revenue.
As others have noted, while Chuck is primarily a hardware designer “a humble purveyor of cycles,” he’s also an architect and programmer. One time at Digital he got an early laptop, with TurboPascal, and immediately started writing cad tools for himself.
I’ve gone off in different ways in the last 15 years, but I was able to visit Chuck in his lab at Microsoft last spring. I have to say not much has changed, he was busy designing logic for an Ethernet controller, only this one runs at a gigabit and fits in a corner of the BEE3 FPGA system.
I am very pleased at the ACM’s recognition of Chuck’s contributions. Now go back to your offices and delete some logic or code that doesn’t really add anything. You will be one step closer to perfection.
-Larry

John Mucci

I am greatly saddened to report the passing of John Mucci this past Sunday.

As some of you know, I have an affinity for people when I cannot predict what they are going to say. I am not talking about randomness, but about folks who have an approach to thinking that is unlike my own and a depth of insight I rarely reach. John was one of those people.

I first met John in 2004 when I got started talking to Matt Reilly and Jud Leonard about SiCortex. John was perhaps foremost a salesman, but he worked not only to understand the technology but to understand it well enough to see how it would apply in new situations. At first I was surprised that the CEO wanted to interview every prospective team member, but I came to treasure the hiring meetings afterwards. Not only could I not predict his opinion, but his assessments made sense.

As far as I could tell, John knew everyone involved with High Performance Computing and usually on a first name basis. Walking the floor with John at Supercomputing was an experience. The mean free path between people he knew was about 10 feet.

Prior to SiCortex, John had worked at Digital Equipment and at Thinking Machines, where our paths almost intersected, as I worked elsewhere at Digital, and later Open Market took over space in Cambridge previously occupied by TMC. I mention this because of another similarity between Thinking Machines and SiCortex – both were well liked by their customers. I know that John Mucci was responsible for the good regard folks have of SiCortex, and I like to think the same was true at Thinking Machines.

He will be missed.

Excellent Ad Placement

Ad placement is the problem of putting an advertisement in exactly the right spot so that the people you are trying to reach will see it.
This week I was at the Supercomputing conference in Portland. The density of iPhones at the conference is very high. One morning I fired up my Accuweather app to find out if I should bring the umbrella. It is Portland after all!

This is outstanding ad placement. You are looking for supercomputer geeks. They congregate in Portland, they have iPhones. They are going to check the weather. Score.

iPhone tweaks

Overall I am very pleased with the iPhone 3G after a year. Like everyone else, I detest ATT.
There are things that could be improved.
UNDO! In the mail application, I sometimes accidentally delete a message. I notice right away, but it is a pain to recover. My several mail apps use different naming schemes for deleted messages. I sure I could fix that, but right now I have to check each of the Trash/Deleted Messages folders to locate the one the iPhone uses. Then I have to scan down the list of deleted messages because the table of contents is kept sorted. I delete so much spam that there can easily be dozens of spams ahead of the message I want. Once I find it, I have to refile it back to the inbox. All thus could be fixed with one undo button.
MARKING – It is a common event that I scan arriving mail while on the go. The iPhone keyboard is not so wonderful that I respond that way. Instead, I’d like to mark the messages I want to deal with later. I could refile them to a todo folder, but I’d rather mark or flag them in the inbox. I’m not much for sorting email into folders. I just create a sequentially numbered archive folder with a few thousand old messages every few months. I’ve been doing this since 1978, so it is kind of set.
The iPhone could use the edit dialog for this. All it does is let you mark mail for deletion right now. Instead, you could flag messages with the edit dialog. To flag a single message, you might swipe it to the left.
Other iPhone apps. I use Google reader on the iPhone. Whevener my daughter borrows the phone, she logs into her account on Google reader. When I get the phone back I have to retype my own login data. The Google reader app could make it easier to select from multiple sets of credentials. Or maybe the whole phone could have a switcher so multiple people could easily share it.
The iPhone needs a way to turn off all phone functionality while leaving WiFi running. This is for airplanes with WiFi.
UPDATE Sept 12.
I forgot the most important improvement to Google Reader. “Mark all items as read” should also return to the feeds view, rather than staying on the particular feed looking at an empty screen. My feed reading style is to scan the entries, reading the ones that look interesting, then “Mark all as read” and move on. I always have to click again to return to the feed list.