A Debugging Story

I’ve been working on fos at MIT CSAIL in recent months. fos is a factored operating system, in which the parts of the OS communicate by sending messages to each other, rather than by communicating by shared memory with locks and traps and so forth.  The idea of fos is to make an OS for manycore chips that is more scalable than existing systems.  It also permits system services to be elastic – to grow and shrink with demand, and it permits the OS to span more than one box, if you want.
The fos messaging system has several implementations.  When you haven’t sent a message to a particular remote mailbox, you send it to the microkernel, which delivers it.  If you keep on sending messages to the same place, then the system creates a shared page between the source and destination address spaces and messages can flow in user mode, which is faster.  Messages that cross machine boundaries are handled by TCP/IP between proxy servers on each end.
I’ve been making the messaging system a bit more object oriented, so that in particular you can have multiple implementations of the user space shared message message transport, with different properties.After I got this to pass the regression tests, I checked it in and went on to other stuff.
Charles Gruenwald, one of the grad students, started using my code in the network stack, as part of a project to eliminate multiple copies of messages.  (I added iovec support, which makes it easier to prepend headers to messages), and his tests were hanging.  Charles was kind enough to give me a repeatable test case, so I was able to find two bugs.  (And yes, I need to fix the regression tests so that they would have found these!)
Fine.
Next, Chris Johnson, another one of the grad students, picked up changes from Charles (and me) and his test program for memcached started to break.
All the above is just the setup.  Chris and I spent about two days tracking this down…
Memcached is a multithreaded application that listens for IP connections, stores data, and gives it back later.  It is used by some large scale websites like facebook.com to cache results that would be expensive to recompute.
When a client sends a data object to memcached for storage, memcached replies on the TCP connection with “STOREDrn”.  On occasion, this 8 character message would get back to Chris’s client as “”, namely all binary 0’s.  Since the git commits between working and not working were associated with my messaging code and the new iovec support, it seemed pretty likely that the problem was there.  However, the problem occurred with <both> the new implementations of shared page messaging, so it couldn’t really be anything unique to one or the other. That left changes in the common code or in the iovec machinery.
Now fos is a research OS, and is somewhat lacking in modern conveniences, such as a debugger, even for library code in user mode.  However, we have printf, and all the sources.
First, we added…  When I say “we” I really mean Chris, because he is a vi/cscope user, and I am emacs/etags.  I think he types faster too.
First we added a strncmp(“STORED”…) inside the  message libraries to locate the case. When the string matched, we set a new global variable to indicate a case of interest. We couldn’t add printf to all the messaging code because it is used all over the place, by many system services. There would be too much output and general flakiness.  Now, with the new global, we could effectively trace down into the messaging libraries, watching the “STORED” go by and printting if it disappeared…. which it did.
However, we got lots of disappearance messages, many due to other messages being sent. Since we also suspected the iovec machinery, we added printfs to print the number and sizes of the iovecs, and their contents.  One of the places we came across was in the fos dispatch library, which is an rpc mechanism that prepends a header on an existing message. The iovec form of this does something like
struct iovec new_iovec[in_iovcnt + 1];
to allocate a variable length array on the stack. Now this is a feature added to the C language as part of ISO C99, and supported in GCC in C90 or C99 mode, but it makes me nervous.  Just in case, we changed the declaration to
struct iovec new_iovec[10];
but it made no difference.
Eventually we found that the “STORED” was there on entry to a function called “sendData”, but had vanished before the sending.  And there were no references to the buffer in the interim.  This suggests that someone is using a pointer after freeing it, and the space has been reallocated to our data buffer, but then clobbered by someone else.  All there was separating the “STORED” from the “”  was a check of the fos name cache to see if the destination mailbox was still valid. More printfs established that the data vanished in exactly the case that the name cache entry had expired, requiring a fos message send to the name server to get a refreshed copy.
A search of the name server library revealed no obvious problem, but there was storage allocation in there, which might be relevant, if in fact the heap had gotten scrambled.
Overnight, I looked at all uses of malloc and free in the messaging library and they all seemed OK, but I thought this was an unlikely idea anyway because the failure happened with both implementations of shared page messaging.
This morning Chris and I had the idea of printing the region around the “STORED” to try and figure out if only our data was changed or if the change was some larger area. This was difficult to tell, because the local region of memory was mostly 0’s already. There was an ascii string a little before our code “suffix” that was also clobbered. We didn’t know what that was, but cscoping and grepping through the entire source tree located it as a name attached to a memcached data structure.  It came to be nearby the “STORED” because memcached did a strdup of a procedure argument, which malloc’d space for the string out of the same general area of the heap.  This clue meant that a larger region of the heap was being clobbered, but we still didn’t know how much.
One aspect, incidently, of this whole affair was that the problem always happened at the same virtual address: 0x709080.  No idea why, but having a stable address makes it much easier to track.
Next, Chris added code to fill the 1024 bytes centered on 0x709080 with 0xFF, and printed what it looked like after the disappearance.  Now this is just gutsy.  We had no idea what data was there, or used by who, and we just overwrote it with the 0xFF pattern, hoping the system would survive long enough to print the “after” pattern.  In fact it crashed immediately, but by changing the size of the 0xFF region, we learned that the clobber affected exactly 136 bytes, all 0’s except the first, which was 0x20.
Well 136 is an odd size.  We grepped the whole code base, to look at any 136s, but did not find any.
Next, we wondered if the clobber might be made by someone calling memcpy or memset. Since the address was stable, we were able to add code to the memcpy library routine something like this:
if (ptr < 0x709080 && (ptr + size) > 0x709080) printf(arguments to memcpy)
But we didn’t get any printfs <at all>… including our own initialization of that space.  We realized that gcc includes an “intrinsic” implementation of memcpy, which it will use when the actual arguments make it convenient .. such as knowlege that the pointers are 8 byte aligned and the length is a constant, or like that.  Now it is possible to turn off the compiler intrinsic by using the -fno-builtins flag to the compiler, so we dug into the fos Makefiles to add this to CFLAGS.
Now we got printfs from memcpy, and a nearly immediate page fault caused by running out of stack space.  It turns out that some variants of printf call memcpy internally, and we had managed a recursive loop.  We also got way too much printout, because we had adding printing to the library copy of memcpy, used by all applications and services. We got out of that by having the memcpy test code check the magic global variable to see if we were inside the code region of interest as well as a second magic variable set only in the memcached application.  We also added a call to print the return address of the caller of memcpy so we could identify who was making the call.
We didn’t find any useful memcpy calls, so we added the same logic to memset.
Widening the test for addresses to cover the entire page containing 709080 we found two 8 byte memset calls to the region right before 709080 but not including 709080.  These calls came from inside the libevent library used by memcached to dispatch work. libevent was preparing a call to select(2). The nearby code was crealloc’ing the file descriptor bit masks and then using memset to zero them out before calling select. This seemed unrelated to our bug, since the memsets didn’t overlap our “STORED” buffer.
Now what?  This could be a storage allocator usage problem, with someone using heap storage after calling free on it, or it could be a buffer overflow problem, with someone writing off the end of an array, but these things are difficult to find.  We thought about replacing malloc with one that carefully checked for some error cases, by putting sentinels around allocated storage.  Even worse, the problem could be that the page of memory had become shared with some other address space, at entirely different virtual addresses.  After all, the suspect messaging code does things like that.
Someone said. “If we had a debugger, we could just use a watchpoint”.  A watchpoint is a way of saying “let me know when this memory location is changed”.  But we had no debugger.  I thought, well, these x86 processors we are using have hardware to support watchpoints, how does it work?
Some work with google and the Linux kernel cross reference website showed that gdb implements watchpoints by using the linux ptrace system call, which in turn, through some elaborate machinery, eventually sets some debug registers deep in the x86 processor chip.  At that point, once any program touches the watched location, the chip generates a debug interrupt, at which point the OS returns control to gdb, letting it explain to the user what happened.
Now we didn’t have gdb, and fos doesn’t have ptrace, and we’re not even running directly on x86 hardware, we’re running inside of a Xen virtual machine hosted by a linux OS, but how hard could it be?
We decided to implement support for hardware watchpoints in fos.
We added a new system call “set debug register”, with no security whatever.  The user program just does this new syscall, passing raw bit values for the debug register.  The microkernel takes the argument, and calls HYPERVISOR_set_debugreg(), which Xen thoughtfully supplies to do the heavy lifting.  We added a second system call to read back the register.
A careful reading of the fos interrupt handlers seemed to say that the debug interrupt, while not expected to be used, did have a default handler in place that would print the machine registers and then crash.
Now, we called this new function to set a hardware watchpoint to 0x709080, and another to turn on the watchpoint control register.  Nothing happened.  We read back the registers, and they seemed to be set to the right bit values, according to wikipedia (and the Intel x86_64 processor reference manual). Now this could happen because we got the code wrong, because Xen didn’t in fact implement this functionality, or who knows. So we added another call to memcpy to overwrite the “STORED” ourselves, and we got an immediate crash dump.
This meant that the mechanism was working, but it wasn’t finding the clobber.  That probably meant that whoever was doing the clobber was running on a different processor core, each of which has their own debug registers.
Now the right way to handle this is for the set_debugreg system call to send messages to all the other cores on the machine to set their debug registers, using inter-processor interrupts.  fos doesn’t have any IPI, and in fact has no way to communicate to different cores in the microkernel.  The only place that needs to do this is the scheduler, which works by locking and then enqueuing processes onto the scheduler data structures of other cores.  No help to us.
But, all cores are running timer interrupts!  So inside our “set debug register” system call, we copied the arguments into microkernel global variables, and set up an array of flags, one per core.  The system call set all the flags to “true”.  Now in the timer interrupt, every core would check the flag for itself, and if set, copy the values in the global to the core’s local debug registers, then clear the flag.
The system call would spin on the flags until they were all clear again, then return to user mode.  This is a really hacky way of having all cores load the 0x709080 into their debug registers at the right moment.
Now this was a little bit of a hail mary. The x86 debug registers work by virtual addresses, so if the clobber were happening because the page was shared, and shared with a different VA, then we would not catch it.
But we did!  We ran the test, which waited until the “STORED” was there, then set the debug registers for 0x709080, and proceded.  We got a crash dump, and the return address on the exception stack was…libevent’s implementation of support for select(2), running in memcached, but in a different thread, on a different processor core than the thread sending “STORED”.
Now all we had was the program counter. We could identify the function by using “nm” to print the symbol table for the memcached executable, but getting to the source line of code is harder.  We found useful switches in objdump, -d -S, which print a disassembled listing of the binary executable code, interspersed with the source code, provided the file was compiled with the -g flag.  That took another spin through the fos Makefiles, which were using -g3, which is evidently some slightly different version of -g that is not compatible with objdump.  Now we were able to see the offending source line as…
FD_ZERO(readset);
or similar.  This is code that is zeroing the file descriptor bit vector about to be used in a call to select.  This was not found by our instrumentation of memset because FD_ZERO was still apparently using a compiler intrinsic, just a straight line set of moveq instructions to zero 128 bytes, in the middle of which was our “STORED” buffer. I’m not sure if -fno-builtin didn’t work for this, or it was controlled by a different makefile for CFLAGS or what.
… FD_ZERO was zeroing 128 bytes of a buffer that had recently been allocated with only 8 bytes of memory.
Now here is another bit of unix/linux history, I think.  When select was first defined, I think by the BSD folks at UC Berkeley, the sizes of the file descriptor bitmaps were variable, and needed to be only large enough to hold the maximum number of file descriptors under consideration.  At some point, linux, blessed later by the POSIX standards committee made the size of select descriptor arrays fixed, with a system specific constant.  In our case, the version of libevent we had was BSD derived, with variable size descriptor arrays, but calling into a select client that was POSIX derived, and expecting a (larger) fixed size.
Incidently, the 136 byte clobber was also now explained, the select code was FD_ZEROing both the readfd and the writefd arrays, which were 8 bytes apart in memory, leading to two overlapping 128 byte clobbers adding to 136 bytes.
The fix to this bug was updating the libevent select client to use fixed size descriptor arrays.  This bug had nothing at all to do with the iovec or messaging code. We just happened to run into it there because the chance coincidence of our messaging buffer containing “STORED” being allocated right after the select descriptor arrays that were too short.
-Larry
 
Followup:  My colleague Matteo Frigo reports:

FD_ZERO is written in assembly (the most misguided "optimization" ever?):
[from <bits/select.h> on glibc/amd64:]
# define __FD_ZERO(fdsp)
 do {                                                                       
   int __d0, __d1;                                                          
   __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS                        
                         : "=c" (__d0), "=D" (__d1)                         
                         : "a" (0), "0" (sizeof (fd_set)                    
                                         / sizeof (__fd_mask)),             
                           "1" (&__FDS_BITS (fdsp)[0])                      
                         : "memory");                                       
 } while (0)

 

iovec to messagelet ring

I’ve been working on fos, the Factored Operating System.  fos is a project at MIT CSAIL. It uses messages to communicate between applications and services, in the same way a standard operating system uses system calls and function calls.
One thing I’ve been doing is adding iovec support to the message API.  This is equivalent to the difference between write(2), which writes a single buffer of data to a file, and writev(2), which uses an iovec data structure to gather pieces of the data into a single write to the file.  An iovec is an array of structures, each of which contains a pointer and length.
One of the types of high performance message transports in fos is a ring of cache-line sized messagelets.  Each messagelet has an 8 byte header and 56 bytes of data.  To send a message, one waits until the next messagelet in the ring is free (as shown by flags in the header), then you write an 8 byte message length field, and then copy the rest of the message into the messagelet.  If it doesn’t all fit, then you mark the first messagelet as filled, and wait for the next one to be free, and continue writing the message.
This is a slow design, because the sender must copy a longer message in 56 byte chunks, but it is also a rather fast method, because the receiver can be draining the head of the message while the sender is writing the tail.  The idea comes from a communications method in the Barrelfish research operating system.
With iovec, the sender has a bigger problem. In order to know the total length of the message, you have to add up the lengths in all the iovec entries. Then, you have to step through the iovec, and copy each one into a sequential series of messagelets.  An iovec entry may end in the middle of a messagelet.
How would you write this?  I’ve just started thinking about it, and will post my code here when I figure it out.
UPDATE
Here’s my version.
 

/* iovec_to_messagelet_ring.c
 * L. Stewart
 * 2011-12-29
 */
#include <stddef.h>
#include <stdint.h>
#include <sys/uio.h>
#include <string.h>
/* Messagelet functions */
typedef void CHANNEL; /* placeholder */
#define ML_SIZE 56
/* Returns a pointer to the data area of a messagelet.
 * The header is -8 bytes offset
 */
void *getfreemessagelet(CHANNEL *ch);
/* sets ready flag in messagelet header, turning it over to the receiver */
void postmessagelet(CHANNEL *ch, void *m);
void send(CHANNEL *ch, struct iovec *in_iov, int in_iovcnt)
{
  total_size = 0;
  int iov_index;  /* current iovec entry */
  void *m = NULL; /* current messagelet */
  void *mp; /* current pointer into messagelet */
  size_t ml_len; /* space left in current messagelet */
  size_t copy_length; /* amount to copy this time around the loop */
  struct iovec iov; /* working iovec entry */
  /* calculate total size of message by adding the lengths of the iovec entries */
  for (iov_index = 0; iov_index < in_iovcnt; iov_index += 1)
    total_size += in_iov[iov_index].iov_len;
  if (total_size == 0) return; /* nothing to do */
  m = getfreemessagelet(ch);
  *((uint64_t *) m) = total_size; /* set length of message */
  mp = (void *) ((uintptr_t) m + sizeof(uint64_t));
  ml_len = ML_SIZE - sizeof(uint64_t);
  iov_index = 0;
  iov.iov_len = 0;
  while (total_size > 0) {
    if (ml_len == 0) {
      m = mp = getfreemessagelet(ch);
      ml_len = ML_SIZE;
    }
    if (iov.iov_len == 0) {
      iov = in_iov[iov_index];
      iov_index += 1;
    }
    copy_length = (iov.iov_len < ml_len) ? iov.iov_len : ml_len;
    memcpy(mp, iov.iov_base, copy_length);
    ml_len -= copy_length;
    iov.iov_len -= copy_length;
    mp = (void *) ((uintptr_t) mp + copy_length);
    iov.iov_base = (void *) ((uintptr_t) iov.iov_base + copy_length);
    if (ml_len == 0) postmessagelet(ch, m);
    total_size -= copy_length;
  }
}

 

Tetrahedron

At work in the aftermath of the Halloween snow storm, one of my colleagues brought in his son because school was closed.  I joined a math discussion between the boy and my boss Steve Heller on the subject of ways to think about products of the form (x + a) (x – a). Afterwards, Steve happened to mention that it was possible to inscribe a tetrahedron inside a cube, and a cube inside a dodecahedron.
The dodecahedron sounds difficult, but I decided to build a tetrahedron inside a cube.  The tetrahedron is cut out of a manilla folder, and the cube is made from a sheet protector.

Tetrahedron inscribed in a cube
Tetrahedron inscribed in a cube

Black Friday Report: Target

Abstract: Mixed
Wednesday evening around 9:15PM I drove my daughter to the Target in Framingham to look for boots.  They were closed.  This was surprising because their newspaper ad said “Open until 11,” and their phone message said “Open until 11.”  In fact, Cathy had spoken to the store operator earlier in the day just to make sure and was told “yes, we are open until 11.”
Thursday night, my daughter and I went with my neighbor to the same store to look at Black Friday doorbusters.  The newspaper ad said they would open at midnight.  They were closed.  The line wrapped halfway around the building.  Eventually some workers came down the line handing out maps.  They said that Massachusetts law wouldn’t let them open at midnight, so they would open at 1AM.  By this time it was around 33 degrees, and still 45 minutes to wait.  We went home.
I looked into this question of law, and found an article dated about 10 days ago which said that Massachusetts Blue Laws forbid employees from working before midnight on Thanksgiving, in order to let them have a holiday.  So evidently, staff could report at midnight, but it took them an hour to unlock the doors.
I think this is one of those situations in which Target, at least this store, doesn’t get it.  They seem honestly puzzled that the public might expect them to be open when their ads say, and expect that staff give correct information about hours, or that anyone might not be grateful for the chance to stand around in freezing weather for an hour in the middle of the night in order to come into their store.
So why do I say “mixed”?  Because I was gullible enough to go back at 6:30AM Friday to the same store that locked me out twice in two days.  And you know? They did a really good job.  All the workers were there. Everyone seemed to know where everything in the store was located. They had adequate stock. They were friendly.
I should add, however, that the store was recently remodelled, turning a once open layout with long sight-lines into the sort of place where you can’t see where you are trying to get to.  The interior is now about halfway between reasonable and Walmart.
 

Driving Practice

My daughter now has a learner’s permit.  For her first outing, she went with my wife to the local elementary school parking lot on a weekend.  Evidently the only casualties were two traffic cones and a portapotty.
Actually they made up the part about the portapotty, but it was a good story.
Later, my wife was talking to a friend about this, and the friend suggested that after parking lot proficiency is attained, the next level is driving in the town cemetary. Nice empty winding roads.  The friend finishes with “and the best thing is you can’t kill anyone.”

Kilauea

This is a bit out of order, reporting on our trip to Volcano National Park on the Big Island in Hawaii.  This was before Hurricane Irene, but I am just getting to it now.
We flew from Maui to Kona on Pacific Wings airline, which the kids now call “Best Airplane Ride Ever”. We flew on a 9 passenger Cessna 208B (a Cessna Caravan single engine turboprop).  The pilot was also the counter agent, baggage handler, and ground crew.  Thinking about it afterwards, it is no wonder that it was a little tricky getting a reservation through Travelocity, our party accounted for 6 of the 9 seats!
We rented a minivan and drove around to the Hilo area, to a rental in Hawaiian Beaches.  This is pretty much at the end of the road in nowhere.  No ATT cell coverage, and no Verizon either. We looked at local attractions for a day and then went to Volcano National Park to visit the Kilauea volcano, which has been erupting, more or less, since 1983.
The current lava flows are from the Pu’u O’o crater, which is in the east fissure zone, and more or less inaccessible without a several hour hike.  Not clear it is a good idea to go there anyway, since the sulphur dioxide concentrations can be lethal within a mile or so if you get downwind.
The main caldera of Kilauea is about 400 feet deep and 2.5 miles across.  Towards the southwest side, there is a smaller crater called Halema’uma’u, which is about 250 feet deep.  Inside Halema’uma’u there is a vent about 500 feet across, and inside that, there is a lava lake whose height fluctuates with volcanic activity. The day we were there the lake level was about 550 feet below the top of the vent.
Overlooking Halema’uma’u there is the Volcano Observatory, and the Jaggar Museum, from the patio of which you can watch events.  Here is a photo I took around 7:15 PM.

Halema'uma'u at dusk

Earlier in the day we drove down the chain of craters road until the end:
Road Closed due to Lava

Across the street there is a sign that is worth reading:
Warning sign

And a short walk to the cliff is  worthwhile as well:
Lava Bridge

Our trip to the volcano was delayed by a couple of hours because the car wouldn’t start.  The dashboard merely said, helpfully, “badkey”.  The remote controls still worked, but the car wouldn’t recognize the RFID chip or whatever is inside these newfangled Chrysler keys.  Alamo rentals was full of warnings not to get the key wet, but we hadn’t.  Alamo sent a local towing company to our out of the way house with a new minivan and took away the old one.  Probably a replacement key would have been sufficient, but we had rented in Kona which is three hours away, rather than from the Hilo office.  Thank you Alamo for taking  care of us, but I guess I am old fashioned.  I’ve never had a mechanical key break and I don’t understand the attraction of the electronic version.

Haleakala

We’ve recently returned from a family vacation to Hawaii.  Cathy and I went to Maui and the Big Island for our honeymoon, and we returned to those islands with the kids, 20 years later.
On August 14, we drove up to the top of Haleakala (“House of the Sun”). This is the 10,000 foot volcano on Maui, and the sunrise is reputed to be spectacular.  We got everyone up at 2:30 AM and got to the top at 5AM, in time to get a parking space in preperation for the sunrise at 6AM.
It is cold up there, even in August

Bundled up on Haleakala

Before sunrise, the sky is quite interesting:
Sky above Haleakala

Then, just as the sun rises, the domes of nearby Science City light up, but not yet the ground.
Science City on Haleaka, first rays of the sun

And here is the sunrise itself:
Sunrise on Haleakala

And for those who keep track of such things, there is no cell coverage by ATT at the top of Haleakala, but Verizon works just fine.

Networking during Hurricane Irene

Hello from within our modest tropical storm Irene.  Here it is just windy and rainy.  The power went off about 4 hours ago, right in the middle of the coffee maker cycle.  I dumped the rest of the water in the reservoir into a pan and brought it a boil on the gas stove, then poured it into the basket. Worked fine.  Without power you have to start the gas stove with a match, and the exhaust fan doesn’t work, but that is OK for minor cooking.
After about 15 minutes, the little UPS on the ethernet switches and FIOS router stopped working.  The FIOS optical network terminal kept running on its own battery.
I suspect this little neighborhood in Wayland is pretty low on NStar’s list of power problems, so I wheeled out the generator to the garage entrance. This is a 6KW electric start machine.  We haven’t needed it for several years, since a round of tree limb triming in town dramatically improved power reliability.  Unfortunately, the generator battery is ten years old,  and hasn’t worked for the last five.  I’ve never been successful in pull starting it unless it was already working, so I gave it a jump start from the DR field mower.
The generator plugs into the house via a 30′ pice of 10-4 cable with 30 Amp connectors.  The house connector in turn is wired to a manual transfer switch that moves 10 circuits from line to generator.  When the house was built, we thought pretty carefully about what to power:
* boiler controls, to permit hot water to work
* refrigerator
* freezer
* kitchen outlets
* outlet near the TV in the family room
* outlets in master bedroom
* outlets near the computer equipment in the basement
* outlet in the study (for my computer!)
* … and I don’t remember where the other two circuits are.  Note to self: find out.
Plus there is 300 feet of 12 gauge extension cord running across the lawn to the neighbor’s house to power their freezer.
This all made sense, but things change, and the house wiring hasn’t.  The FIOS ONT is in the utility room, and there is no generator outlet in there.  So now there is a 25 foot extension cord connecting it to the server outlets.  Similarly, we moved the freezer so now there is another extension cord connecting it to a powered outlet.
The little UPS is a problem. When the power came back on, the UPS hasn’t switched back. It just beeps fitfully. Note to self: a cheap UPS from Best Buy is probably worth every penny!
My son Alex was so offended by the lack of power for the family iMac that he’s moved it to the floor of the MBR and figured out which outlet is live. He also moved the Time Capsule that supports upstairs WiFi, and then I had to show him how to interpret the patch panel diagram to get it plugged into a live network port.  Cathy doesn’t approve of kids using the internet during a power outage,  but I figure I should reward initiative.
The home server had been up for 242 days, but it hasn’t restarted.  I will have to go troubleshoot.  The only difficulty with this is that we don’t have DNS service for the inside machines.  For talking to the world, we can just switch to Google’s DNS at 8.8.8.8, which is easy to remember.
The roof is leaking, but it is the place that just happens to drip into the kitchen sink.  Is that good planning or just luck?
I don’t know whether to expect FIOS to stay up long term or not.  The fiber goes to the local CO, which has lots of batteries, but I don’t know if there are active components between here and there, and I don’t know what is upstream from the local CO.
Updates:
The home server came up fine, and if you wait long enough, ssh to it works.  The problem is that its upstream DNS is the server in Win’s basement, which is down right now.
One of the smoke detectors is unhappy about the lack of AC power.  It probably needs a new battery, but it is the one about 14 feet off the ground in the loft.  I can reach it with the extension ladder, but that is out in the rain behind the house.  Ah well.
So far the chicken coop hasn’t blown over, and the run is still standing.  The chickens, sensibly, are staying inside.

Connected-only devices

I write this on a Google Chromebook while flying to San Francisco on Virgin America.
I am happy that the Google is trying out this concept, but it is on the wrong side of technology and its not what I want.

  • Storage is cheap, communications are not
  • Storage is low power, communications are not
  • Local storage always works, communications does not
  • My use of local storage is private, in the cloud there are watchers
  • Local operations have predictable performance, remote does not

The key issue is that storage is really inexpensive and getting more so.  My three year old phone has 16GB os space. My iPad has 64GB.  The Macbook Air I covet has, well, who knows?  Removing the storage from the device solves a non-problem by introducing serious new problems.  I don’t get it.
My laptop (yes, a Macbook Pro) has a 500 GB drive. When I am disconnected, I can write, I can read, I can watch the movie backlog, I can program. I can learn. I can tag photos. I can do quite a  lot. I have pretty much my entire working set with me.  There are a couple of terabytes of other stuff laying around at home, but I don’t need that very often.
The pressing problem with mobile devices is power, not storage.  Why replace a low power storage device, that has predictable and good performance, with a slow, unreliable, communications channel that has a variable cost structure?
There are important roles for cloud storage:  backup, search, bulk processing, but it doesn’t make sense to move active storage to the other side of a high latency low bandwidth channel.  Let’s imagine that the communications is actually reliable and has zero variable costs for a moment.  But it still has, say 40 millisecond latency and a megabit or so bandwidth.  This is going to work file for email, chat, and so forth. But it cannot be a good video editor, or image browser.  I’ve had the experience of using Aperture to browse a few thousand photos on a local SSD. It is a surreal experience – the closest we’ve yet come to Minority Report.
The chromebook is a decent effort. I like the keyboard. The screen is nice, the weight is nice, the battery life is nice, but the lack of storage and a real local filesystem is just silly.

Back in the saddle

I have started a new job a couple of months ago, working part time at Quanta Research Cambridge.  I’ll say more about that later, but this post is about bicycles.
My new boss, Steve Heller, mentioned that one could park in downtown Lexington MA for two dollars a day, and take the bike path to Cambridge. From Lexington to the Alewife T station in Cambridge is about 6.5 miles, along a very nice bike trail, then it another 3.5 miles to Kendall Square, part path to Davis Square, and then down Hampshire Street.
This is a very fun ride, inbound is slightly downhill, 200 feet over 10 miles, with no particular hills.  Outbound is a little uphill, and mostly upwind in the afternoon, but fun.
Now there is another fellow at the lab, Willie Walker, who sometimes bicycle commutes in from New Hampshire, and that is a different matter altogether.  For some reason I thought he lived in the Western suburbs somewhere, so I thought I would try biking in from Wayland to Cambridge, which is about 18 miles each way.
I am not certain of the best route for this, but so far I take Route 20 to the old Boston Post Road to Weston center (4.4 miles) then Church Street up to 117, and 117 back to Route 20 in Waltham. Just past Prospect Park there is the Blue Heron trail that runs along the Charles River, from Waltham to Newton Corner.  From there you can go on the south side of the river along Nonantum Road to the Soldiers Field area, or you can go on Charles River Road and Greenough Drive along the North side of the river.  Both have bike lanes, although Nonantum is under construction.  At JFK Drive, I head in towards Harvard Square, but turn right on Mt Auburn Street and follow it to Central Square, then take Bishop Allen, Ellis, Harvard, and whatever else seems handy over to the office at 1 Kendall Square.
Inbound is easier than outbound, the Waltham hills are steeper on the East side, it is hotter in the afternoon, and still upwind. I now look forward to this and try to do it twice a week. When I can also do the Lexington route once a week I am a happy boy with another 100 miles.
It should be straightforward to beat my old SiCortex bicycling goals of 1000 miles a year.  But remember Will?  As of mid July, he’s already at 4000 miles for the year.
Oh yes, along the Blue Heron trail, about a mile and a bit from the Western end, is this beautiful bicycle and pedestrian bridge.

Blue Heron trail bridge
Blue Heron trail bridge