A Debugging Story

I’ve been working on fos at MIT CSAIL in recent months. fos is a factored operating system, in which the parts of the OS communicate by sending messages to each other, rather than by communicating by shared memory with locks and traps and so forth.  The idea of fos is to make an OS for manycore chips that is more scalable than existing systems.  It also permits system services to be elastic – to grow and shrink with demand, and it permits the OS to span more than one box, if you want.
The fos messaging system has several implementations.  When you haven’t sent a message to a particular remote mailbox, you send it to the microkernel, which delivers it.  If you keep on sending messages to the same place, then the system creates a shared page between the source and destination address spaces and messages can flow in user mode, which is faster.  Messages that cross machine boundaries are handled by TCP/IP between proxy servers on each end.
I’ve been making the messaging system a bit more object oriented, so that in particular you can have multiple implementations of the user space shared message message transport, with different properties.After I got this to pass the regression tests, I checked it in and went on to other stuff.
Charles Gruenwald, one of the grad students, started using my code in the network stack, as part of a project to eliminate multiple copies of messages.  (I added iovec support, which makes it easier to prepend headers to messages), and his tests were hanging.  Charles was kind enough to give me a repeatable test case, so I was able to find two bugs.  (And yes, I need to fix the regression tests so that they would have found these!)
Next, Chris Johnson, another one of the grad students, picked up changes from Charles (and me) and his test program for memcached started to break.
All the above is just the setup.  Chris and I spent about two days tracking this down…
Memcached is a multithreaded application that listens for IP connections, stores data, and gives it back later.  It is used by some large scale websites like facebook.com to cache results that would be expensive to recompute.
When a client sends a data object to memcached for storage, memcached replies on the TCP connection with “STOREDrn”.  On occasion, this 8 character message would get back to Chris’s client as “”, namely all binary 0’s.  Since the git commits between working and not working were associated with my messaging code and the new iovec support, it seemed pretty likely that the problem was there.  However, the problem occurred with <both> the new implementations of shared page messaging, so it couldn’t really be anything unique to one or the other. That left changes in the common code or in the iovec machinery.
Now fos is a research OS, and is somewhat lacking in modern conveniences, such as a debugger, even for library code in user mode.  However, we have printf, and all the sources.
First, we added…  When I say “we” I really mean Chris, because he is a vi/cscope user, and I am emacs/etags.  I think he types faster too.
First we added a strncmp(“STORED”…) inside the  message libraries to locate the case. When the string matched, we set a new global variable to indicate a case of interest. We couldn’t add printf to all the messaging code because it is used all over the place, by many system services. There would be too much output and general flakiness.  Now, with the new global, we could effectively trace down into the messaging libraries, watching the “STORED” go by and printting if it disappeared…. which it did.
However, we got lots of disappearance messages, many due to other messages being sent. Since we also suspected the iovec machinery, we added printfs to print the number and sizes of the iovecs, and their contents.  One of the places we came across was in the fos dispatch library, which is an rpc mechanism that prepends a header on an existing message. The iovec form of this does something like
struct iovec new_iovec[in_iovcnt + 1];
to allocate a variable length array on the stack. Now this is a feature added to the C language as part of ISO C99, and supported in GCC in C90 or C99 mode, but it makes me nervous.  Just in case, we changed the declaration to
struct iovec new_iovec[10];
but it made no difference.
Eventually we found that the “STORED” was there on entry to a function called “sendData”, but had vanished before the sending.  And there were no references to the buffer in the interim.  This suggests that someone is using a pointer after freeing it, and the space has been reallocated to our data buffer, but then clobbered by someone else.  All there was separating the “STORED” from the “”  was a check of the fos name cache to see if the destination mailbox was still valid. More printfs established that the data vanished in exactly the case that the name cache entry had expired, requiring a fos message send to the name server to get a refreshed copy.
A search of the name server library revealed no obvious problem, but there was storage allocation in there, which might be relevant, if in fact the heap had gotten scrambled.
Overnight, I looked at all uses of malloc and free in the messaging library and they all seemed OK, but I thought this was an unlikely idea anyway because the failure happened with both implementations of shared page messaging.
This morning Chris and I had the idea of printing the region around the “STORED” to try and figure out if only our data was changed or if the change was some larger area. This was difficult to tell, because the local region of memory was mostly 0’s already. There was an ascii string a little before our code “suffix” that was also clobbered. We didn’t know what that was, but cscoping and grepping through the entire source tree located it as a name attached to a memcached data structure.  It came to be nearby the “STORED” because memcached did a strdup of a procedure argument, which malloc’d space for the string out of the same general area of the heap.  This clue meant that a larger region of the heap was being clobbered, but we still didn’t know how much.
One aspect, incidently, of this whole affair was that the problem always happened at the same virtual address: 0x709080.  No idea why, but having a stable address makes it much easier to track.
Next, Chris added code to fill the 1024 bytes centered on 0x709080 with 0xFF, and printed what it looked like after the disappearance.  Now this is just gutsy.  We had no idea what data was there, or used by who, and we just overwrote it with the 0xFF pattern, hoping the system would survive long enough to print the “after” pattern.  In fact it crashed immediately, but by changing the size of the 0xFF region, we learned that the clobber affected exactly 136 bytes, all 0’s except the first, which was 0x20.
Well 136 is an odd size.  We grepped the whole code base, to look at any 136s, but did not find any.
Next, we wondered if the clobber might be made by someone calling memcpy or memset. Since the address was stable, we were able to add code to the memcpy library routine something like this:
if (ptr < 0x709080 && (ptr + size) > 0x709080) printf(arguments to memcpy)
But we didn’t get any printfs <at all>… including our own initialization of that space.  We realized that gcc includes an “intrinsic” implementation of memcpy, which it will use when the actual arguments make it convenient .. such as knowlege that the pointers are 8 byte aligned and the length is a constant, or like that.  Now it is possible to turn off the compiler intrinsic by using the -fno-builtins flag to the compiler, so we dug into the fos Makefiles to add this to CFLAGS.
Now we got printfs from memcpy, and a nearly immediate page fault caused by running out of stack space.  It turns out that some variants of printf call memcpy internally, and we had managed a recursive loop.  We also got way too much printout, because we had adding printing to the library copy of memcpy, used by all applications and services. We got out of that by having the memcpy test code check the magic global variable to see if we were inside the code region of interest as well as a second magic variable set only in the memcached application.  We also added a call to print the return address of the caller of memcpy so we could identify who was making the call.
We didn’t find any useful memcpy calls, so we added the same logic to memset.
Widening the test for addresses to cover the entire page containing 709080 we found two 8 byte memset calls to the region right before 709080 but not including 709080.  These calls came from inside the libevent library used by memcached to dispatch work. libevent was preparing a call to select(2). The nearby code was crealloc’ing the file descriptor bit masks and then using memset to zero them out before calling select. This seemed unrelated to our bug, since the memsets didn’t overlap our “STORED” buffer.
Now what?  This could be a storage allocator usage problem, with someone using heap storage after calling free on it, or it could be a buffer overflow problem, with someone writing off the end of an array, but these things are difficult to find.  We thought about replacing malloc with one that carefully checked for some error cases, by putting sentinels around allocated storage.  Even worse, the problem could be that the page of memory had become shared with some other address space, at entirely different virtual addresses.  After all, the suspect messaging code does things like that.
Someone said. “If we had a debugger, we could just use a watchpoint”.  A watchpoint is a way of saying “let me know when this memory location is changed”.  But we had no debugger.  I thought, well, these x86 processors we are using have hardware to support watchpoints, how does it work?
Some work with google and the Linux kernel cross reference website showed that gdb implements watchpoints by using the linux ptrace system call, which in turn, through some elaborate machinery, eventually sets some debug registers deep in the x86 processor chip.  At that point, once any program touches the watched location, the chip generates a debug interrupt, at which point the OS returns control to gdb, letting it explain to the user what happened.
Now we didn’t have gdb, and fos doesn’t have ptrace, and we’re not even running directly on x86 hardware, we’re running inside of a Xen virtual machine hosted by a linux OS, but how hard could it be?
We decided to implement support for hardware watchpoints in fos.
We added a new system call “set debug register”, with no security whatever.  The user program just does this new syscall, passing raw bit values for the debug register.  The microkernel takes the argument, and calls HYPERVISOR_set_debugreg(), which Xen thoughtfully supplies to do the heavy lifting.  We added a second system call to read back the register.
A careful reading of the fos interrupt handlers seemed to say that the debug interrupt, while not expected to be used, did have a default handler in place that would print the machine registers and then crash.
Now, we called this new function to set a hardware watchpoint to 0x709080, and another to turn on the watchpoint control register.  Nothing happened.  We read back the registers, and they seemed to be set to the right bit values, according to wikipedia (and the Intel x86_64 processor reference manual). Now this could happen because we got the code wrong, because Xen didn’t in fact implement this functionality, or who knows. So we added another call to memcpy to overwrite the “STORED” ourselves, and we got an immediate crash dump.
This meant that the mechanism was working, but it wasn’t finding the clobber.  That probably meant that whoever was doing the clobber was running on a different processor core, each of which has their own debug registers.
Now the right way to handle this is for the set_debugreg system call to send messages to all the other cores on the machine to set their debug registers, using inter-processor interrupts.  fos doesn’t have any IPI, and in fact has no way to communicate to different cores in the microkernel.  The only place that needs to do this is the scheduler, which works by locking and then enqueuing processes onto the scheduler data structures of other cores.  No help to us.
But, all cores are running timer interrupts!  So inside our “set debug register” system call, we copied the arguments into microkernel global variables, and set up an array of flags, one per core.  The system call set all the flags to “true”.  Now in the timer interrupt, every core would check the flag for itself, and if set, copy the values in the global to the core’s local debug registers, then clear the flag.
The system call would spin on the flags until they were all clear again, then return to user mode.  This is a really hacky way of having all cores load the 0x709080 into their debug registers at the right moment.
Now this was a little bit of a hail mary. The x86 debug registers work by virtual addresses, so if the clobber were happening because the page was shared, and shared with a different VA, then we would not catch it.
But we did!  We ran the test, which waited until the “STORED” was there, then set the debug registers for 0x709080, and proceded.  We got a crash dump, and the return address on the exception stack was…libevent’s implementation of support for select(2), running in memcached, but in a different thread, on a different processor core than the thread sending “STORED”.
Now all we had was the program counter. We could identify the function by using “nm” to print the symbol table for the memcached executable, but getting to the source line of code is harder.  We found useful switches in objdump, -d -S, which print a disassembled listing of the binary executable code, interspersed with the source code, provided the file was compiled with the -g flag.  That took another spin through the fos Makefiles, which were using -g3, which is evidently some slightly different version of -g that is not compatible with objdump.  Now we were able to see the offending source line as…
or similar.  This is code that is zeroing the file descriptor bit vector about to be used in a call to select.  This was not found by our instrumentation of memset because FD_ZERO was still apparently using a compiler intrinsic, just a straight line set of moveq instructions to zero 128 bytes, in the middle of which was our “STORED” buffer. I’m not sure if -fno-builtin didn’t work for this, or it was controlled by a different makefile for CFLAGS or what.
… FD_ZERO was zeroing 128 bytes of a buffer that had recently been allocated with only 8 bytes of memory.
Now here is another bit of unix/linux history, I think.  When select was first defined, I think by the BSD folks at UC Berkeley, the sizes of the file descriptor bitmaps were variable, and needed to be only large enough to hold the maximum number of file descriptors under consideration.  At some point, linux, blessed later by the POSIX standards committee made the size of select descriptor arrays fixed, with a system specific constant.  In our case, the version of libevent we had was BSD derived, with variable size descriptor arrays, but calling into a select client that was POSIX derived, and expecting a (larger) fixed size.
Incidently, the 136 byte clobber was also now explained, the select code was FD_ZEROing both the readfd and the writefd arrays, which were 8 bytes apart in memory, leading to two overlapping 128 byte clobbers adding to 136 bytes.
The fix to this bug was updating the libevent select client to use fixed size descriptor arrays.  This bug had nothing at all to do with the iovec or messaging code. We just happened to run into it there because the chance coincidence of our messaging buffer containing “STORED” being allocated right after the select descriptor arrays that were too short.
Followup:  My colleague Matteo Frigo reports:

FD_ZERO is written in assembly (the most misguided "optimization" ever?):
[from <bits/select.h> on glibc/amd64:]
# define __FD_ZERO(fdsp)
 do {                                                                       
   int __d0, __d1;                                                          
   __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS                        
                         : "=c" (__d0), "=D" (__d1)                         
                         : "a" (0), "0" (sizeof (fd_set)                    
                                         / sizeof (__fd_mask)),             
                           "1" (&__FDS_BITS (fdsp)[0])                      
                         : "memory");                                       
 } while (0)


One thought on “A Debugging Story”

  1. Great sleuthing! I seem to remember the bit vectors were once variably sized; it caused some hassles when DEC Unix upped the max from 1024 to 4096 FDs. Scary how BSD and POSIX haven’t converged on simple things like select(2) after 25 years.

Leave a Reply

Your email address will not be published. Required fields are marked *