Email Disaster Recovery and Travel adventures

Cathy is off to China for a few weeks. She wanted email access, but not with her usual laptop.
She uses Windows Vista on a plasticy HP laptop from, well, the Vista era.  It is quite heavy, and these days quite flaky.  It has a tendency to shut down, although not for any obvious reason, other maybe than age, and being Vista running on a plasticy HP laptop.
I set up the iPad, but Cathy wanted a more familiar experience, and needed IE in order to talk to a secure webmail site, so we dusted off an Asus EEE netbook running Windows XP.
I spent a few hours trying to clear off several years off accumulated crapware such as three different search toolbars attached to Internet Explorer, then gave up and re-installed XP from the recovery partition.  123 Windows Updates later, it seemed fine, but still wouldn’t talk to the webmail site.  It turns out that Asus thoughtfully installed the open source local proxy server Privoxy, with no way to uninstall it.  If you run the Privoxy uninstall, it leaves you with no web access at all.  I finally found Interwebs advice to also uninstall the Asus parental controls software, and that fixed it.
Next, I installed Thunderbird, and set it up to work with Cathy’s account on the family compound IMAP server.  I wanted it to work offline, in case of spotty WiFi access in China, but after setting that up, so I “unsubscribed” to most of the IMAP folders and let it download.  Now Cathy’s inbox has 34,000 messages in it, and I got to thinking “what about privacy?”  After all, governments, especially the United States, claim the right to search all electronic devices at the border, and it is also commonly understood that any electronic device you bring to China can be pwned before you come back.
Then I found a setting that tells Thunderbird to download only the last so many days for offline use.  Great!  But it had already downloaded all 6 years of back traffic.  Adjacent, there is a setting for “delete mail more than 20 days (or whatever) old.”
You know what happens next!  I turned that on, and Thunderbird started deleting all Cathy’s mail, both locally and on the server.  Now there is (farther down the page), fine print that explains this will happen, but I didn’t read it.
Parenthetically, this is an awful design.  It really looks like a control associated with how much mail to keep for offline use, but it is not.  It is a dangerous, unguarded, unconfirmed command that does irreversible damage.
I thought this was taking too long, but by the time I figured it out, it was way too late.
So, how to recover?
I have been keeping parallel feeds from Cathy’s email, but only since March or so, since I’ve been experimenting with various spam supression schemes.
I had made a copy of Cathy’s .maildir on the server, but it was from 2011.
But wait! Cathy’s laptop was configured for offline use, and had been turned off.  Yes!  I opened the lid and turned off WiFi as quickly as possible, before it had a chance to sync.  (Actually, the HP has a mechanical switch to turn off WiFi, but I didn’t know that.)  I then changed the username/password on her laptop Thunderbird to stop further syncing.
Next, since the horse was well out of the barn, I made a snapshot of the server .maildir, and of the HP laptop’s Thunderbird profile directories. Now, whatever I did, right or wrong, I could do again.
Time for research!
What I wanted to do seemed clear:  restore the off-line saved copies of the mail from the HP laptop to the IMAP server.  This is not a well-travelled path, but there is some online advice:
http://www.fahim-kawsar.net/blog/2011/01/09/gmail-disaster-recovery-syncing-mail-app-to-gmail-imap/
https://support.mozillamessaging.com/en-US/kb/profiles
The general idea is:

  1. Disconnect from the network
  2. Make copies of everything
  3. While running in offline mode, copy messages from the cached IMAP folders to “Local” folders
  4. Reconnect to the network and sync with the server. This will destroy the cached IMAP folders, but not the new Local copies
  5. Copy from the Local folders back to the IMAP server folders

Seems simple, but in my case, there were any number of issues:

  • Not all server folders were “subscribed” by Thunderbird, and I didn’t know which ones were
  • The deletion was interrupted at some point
  • I didn’t want duplicated messages after recovery
  • INBOX was 10.3 GB (!)
  • The Thunderbird profile altogether was 23 GB (!)
  • The HP laptop was flakey
  • Cathy’s about to leave town, and needs last minute access to working email

One thing at a time.
Tools
I found out about  “MozBackup” and used it to create a backup copy of the HP laptop’s profile directory.
MozBackup
MozBackup creates a zip file of the contents of a Thunderbird profile directory, and can restore them to a different profile on a different computer, making configuration changes as appropriate. This is much better than hand editting the various Thunderbird configuration files.
Hardware problems
As I mentioned, the HP laptop is sort of flakey.  I succeeded in copying the Thunderbird profile directory, but 23 GB worth of copying takes a long time on a single 5400 rpm laptop disk.  I tried copying to a Mybook NAS device, but it was even slower.  What eventually worked, not well, but adequately, was copying to a 250GB USB drive.
I decided to leave the HP out of it, and to do the recovery on the netbook, the only other Windows box available.  I was able to create a second profile on the netbook, and restore the saved profile to it, slowly, but I realized Cathy would leave town before I finished all the steps, taking the netbook with her.  Back to the HP.
First I tried just copying the IMAPMail subfolder files of mbox files and msf files to LocalFolders. This seemed to work, but Thunderbird got very confused about it.  It said there were 114000 messages in Inbox, rather than 34000.  This shortcut is a dead end.
I created a new profile on the HP, and restored the backup using MozBackup (which took 2 hours), and started it in offline mode.  I then tried to “select-all” in Inbox to copy them to a local folder.  Um.  No.  I couldn’t even get control back.  Thunderbird really cannot select 34000 messages and do anything.
Because I was uncertain about the state of the data, I restored the backup again (another 2 hours).
This time, I decided to break up Inbox into year folders, each holding about 7000 messages.  The first one worked, but then the HP did an undexpected shutdown during the second, and when it came back, Inbox was empty! The Inbox mbox file had been deleted.
I did another restore, and managed to create backup files for 2012 and 2011 messages, before it crashed again. (And Inbox was gone AGAIN)
The technique seemed likely to eventually work, but it would drive me crazy.  Or crazier.
I was now accumulating saved Local Folder files representing 3 specific years of Inbox.  I still had to finish the rest, deal with Sent Mail, and audit about 50 other subfolders to see if they needed to be recovered.
I wasn’t too worried about all the archived subfolders, since they hadn’t changed in ages and were well represented by my 2011 copy of Cathy’s server .maildir
Digression
What about server backups?  Embarassing story here!  Back in 2009, Win and I built some nice mini-ATX atom based servers with dual 1.5T disks run in mirrored mode for home servers.  Win’s machine runs the IMAP, and mine mostly has data storage.  Each machine has the mirrored disks for reliabiltiy and a 1.5T USB drive for backup.  The backups are irregularly kept up to date, and in the IMAP machines case, not recently.
About 6 months ago, I got a family pack of CrashPlan for cloud backup, and I use it for my Macbook and for my (non IMAP) server, but we had never gotten around to setting up CrashPlan for either Cathy’s laptop or the IMAP server.
A few months ago, we got a Drobo 5N, and set it up with 3 3T disks, for 6T usable storage, but we haven’t gotten it working for backup either.  (I am writing another post about that.)
So, no useful server backups for Cathy’s mail.
Well now what?
I have a nice Macbook Pro, unfortunately, the 500 GB SSD has 470 GB of data, not enough for one copy of Cathy’s cached mail, let alone two.  I thought about freeing up space, and copied a 160 GB Aperture photo library to two other systems, but it made me nervous to delete it from the Macbook.
I then tried using Mac Thunderbird to set up a profile on that 250 GB external USB drive, but it didn’t work because the FAT filesystem couldn’t handle Mac Thunderbird’s need for fancy filesystem features like ACLs, but this triggered an idea!
First, I was nervous about using Mac Thunderbird to work on backup data from a PC. I know that Thunderbird profile directories are supposed to be cross-platform, but the config files like profile.ini and prefs.js are littered with PC pathnames.
Second, the USB drive is slow, even if it worked.
Up until recently, I’ve been using a 500 GB external Firewire drive for TimeMachine backups of the Macbook.  It still was full of Time Machine data, but I’ve switched to using a 1T partition on the Drobo for TimeMachine.  I also have the CrashPlan backup.  So I reformatted the Firewire Drive to HFS, and plugged it in as extra storage.
Also on the Macbook, is VMWare Fusion, and one of my VMs is a 25 GB instance running XP Pro.
I realized I should be able to move the VM to the Firewire drive, and expand its storage by another 50 GB or so to have room to work on the 23 GB Thunderbird data.
To the Bat Cave!
It turns out to be straightforward to copy a VMWare image to another place, and then run the copy.  Rather than expand the 25GB primary disk, I just added a second virtual drive and used XP Disk management to format it as drive E.  I also used VMWare sharing to share access to the underlying Mac filesystem on the Firewire drive.

  1. Copy VMWare image of XP to the Firewire drive
  2. Copy MozBackup save file of the cached IMAP data and the various Local Files folders to the drive
  3. Create second disk image for XP
  4. Run XP under VMWare Fusion on the Macbook, using the Firewire drive for backing store
  5. Install Thunderbird and MozBackup
  6. Use Mozbackup to restore Cathy’s cached local copies of her mail from the flakey HP laptop
  7. Copy the Local Files mailbox files for 2013, 2012, and 2011 into place.
  8. Use XP Thunderbird running under VMWare to copy the rest of the cached IMAP data into Local Folders.
  9. By hand, compare message counts of all 50 or so other IMAP folders in the cached copy with those still on the server, and determine they were still correct.
  10. Go online, letting Thunderbird sync with the server, deleting all the locally cached IMAP data.
  11. Create IMAP folders for 2007 through 2013, plus Sent Mail and copy the roughly 40000 emails back to the server.

Notes
During all of this, new mail continued to arrive into the IMAP server, and be accessible by the instance of Thunderbird on the netbook.
A copy of Cloudmark Desktop One was active running on the Macbook using Mac Thunderbird to do spam processing of arriving email in Cathy’s IMAP account.
My psyche is scarred, but I did manage to recover from a monstrous mistake.
Lessons

  • RAID IS NOT BACKUP

The IMAP server was reliable, but it didn’t have backups that were useful for recovery.

  • Don’t think you understand what a complex email client is going to do

Don’t experiment with the only copy of something!  I should have made a copy of the IMAP .maildir in a new account, and then futzed with the netbook thunderbird to get the offline use storage the way I wanted.

  • Quantity has a quality all its own.

This quote is usually about massive armies, but in this case, the very large email (23 GB) just made the simplest operations slow, and some things (like selecting ALL in a folder with 34000 messages, impossible.)  I had to go through a lot of extra work because various machines didn’t have enough free storage, and had other headaches because the MTBF of the HP laptop was less than the time to complete tasks.
-Larry

Hypervisor Hijinks

At my office, we have a rack full of Tilera 64-core servers, 120 of them. We use them for some interesting video processing applications, but that is beside the point. Having 7680 of something running can magnify small failure rates to the point that they are worth tracking down. Something that might take a year of runtime to show up can show up once an hour on a system like this
Some of the things we see tend, with some slight statistical flavor, to occur more frequently on some nodes than on others. That just might make you think that we have some bad hardware. Could be. We got to wondering whether running the systems at slightly higher core voltages would make a difference, and indeed, one can configure such a thing, but basically you have to reprogram the flash bootloaders on 120 nodes. The easiest thing to do was to change both the frequency and the voltage, which isn’t the best thing to do, but it was easy. The net effect was to reduce the number of already infrequent faults on the nodes where they occurred, but to cause, maybe, a different sort of infrequent fault on a different set of nodes.
Yow. That is NOT what we wanted.
We were talking about this, and I said about the stupidest thing I’ve said in a long time. It was, approximately:

I think I can add some new hypervisor calls that will let us change the core voltage and clock frequency from user mode.

This is just a little like rewiring the engines of an airplane while flying, but if it were possible, we could explore the infrequent fault landscape much more quickly.
But really, how hard could it be?
Tilera, to their great credit, supplies a complete Multicore Development Environment which includes the linux kernel sources and the hypervisor sources.
The Tilera version of Linux has a fairly stock kernel which runs on top of a hypervisor that manages physical chip resources and such things as TLB refills. There is also a hypervisor “public” API, which is really not that public, it is available to the OS kernel. The Tilera chip has 4 protection rings. The hypervisor runs in kernel mode. The OS runs in supervisor mode, and user programs can run in the other two. The hypervisor API has things like load this page table context, or flush this TLB entry, and so forth.
As part of the boot sequence, one of the things the hypervisor does is to set the core voltage and clock frequency according to a little table it has. The voltage and frequency are set together, and the controls are not accessible to the Linux kernel or to applications. Now it is obviously possible to change the values while running, because that is what the boot code does. What I needed to do was to add some code to the hypervisor to get and set the voltage and frequency separately, while paying attention to the rules implicit in the table. There are minimum and maximum voltages and frequencies beyond which the chip will stop working, and there are likely values that will cause permanent damage. There is also a relation between the two – generally higher frequencies will require higher voltages. Consequently it is not OK to set the frequency too high for the current voltage, or to set the voltage too low for the current frequency.
Fine. Now I have subroutine calls inside the hypervisor. In order to make them available to a user mode program running under Linux, I have to add hypervisor calls for the new functions, and then add something like a loadable kernel module to Linux to call them and to make the functionality available to user programs.
The kernel piece is sort of straightforward. One can write a loadable kernel module that implements something called sysfs. These are little text files in a directory like /sys/kernel/tilera/ with names like “frequency” and “voltage”. Through the magic of sysfs, when an application writes a text string into one of these files, a piece of code in the kenel module gets called with the string. When an application reads one of these files, the kernel module gets called to provide the text.
Now, with the kernel module at the top, and the new subroutines in the hypervisor at the bottom, all I need to do is wire them together by adding new hypervisor calls.
Hypervisor calls make by linux are done by hypervisor glue. The glue area starts at 0x10000 above the base of the text area, and each possible call has 0x20 bytes of instructions available.
Sometimes, such as “nanosleep”, the call is implemented inline in those 0x20 bytes. Mostly, the code in the glue area loads a register with a call number and does a software interrupt.
The code that builds the glue area is is hv/tilepro/glue.S.
For example, the nanosleep code is

hv_nanosleep:
        /* Each time through the loop we consume three cycles and
         * therefore four nanoseconds, assuming a 750 MHz clock rate.
         *
         * TODO: reading a slow SPR would be the lowest-power way
         * to stall for a finite length of time, but the exact delay
         * for each SPR is not yet finalized.
         */
        {
          sadb_u r1, r0, r0
          addi r0, r0, -4
        }
        {
          add r1, r1, r1 /* force a stall */
          bgzt r0, hv_nanosleep
        }
        jrp lr
        fnop
while most others are

GENERIC_SWINT2(set_caching)
or the like. where GENERIC_SWINT2 is a macro:

#define GENERIC_SWINT2(name)
        .align ALIGN ;
hv_##name:
        moveli TREG_SYSCALL_NR_NAME, HV_SYS_##name ;
        swint2 ;
        jrp lr ;
        fnop
The glue.S source code is written in a positional way, like
GENERIC_SWINT2(get_rtc)
GENERIC_SWINT2(set_rtc)
GENERIC_SWINT2(flush_asid)
GENERIC_SWINT2(flush_page)

so the actual address of the linkage area for a particular call like flush_page depends on the exact sequence of items in glue.S. If you get them out of order or leave a hole, then the linkage addresses of everything later will be wrong. So to add a hypercall, you add items immediately after the last GENERIC_SWINT2 or ILLEGAL_SWINT2
In the case of the set_voltage calls we have:

ILLEGAL_SWINT2(get_ipi_pte)
GENERIC_SWINT2(get_voltage)
GENERIC_SWINT2(set_voltage)
GENERIC_SWINT2(get_frequency)
GENERIC_SWINT2(set_frequency)

With this fixed point, we work in both directions, down into the hypervisor to add the call and up into linux to add something to call it.
Looking back at the GENERIC_SWINT2 macro, it loads a register with the value of a symbol like HV_SYS_##name where name is the argument to GENERIC_SWINT2. This is using the C preprocessor stringification operator ## that concatenates. So

GENERIC_SWINT2(get_voltage)

expects a symbol named HV_SYS_get_voltage. IMPORTANT NOTE – the value of this symbol has nothing to do with the hypervisor linkage area, it is only used in the swint2 implementation. The HV_SYS_xxx symbols are defined in hv/tilepro/syscall.h and are used by glue.S to build the code in the hypervisor linkage area and also used by hv/tilepro/intvec.S to build the swint2 handler.
In hv/tilepro/intvec.S we have things like

 syscall HV_SYS_flush_all,	      syscall_flush_all
syscall	HV_SYS_get_voltage,	            syscall_get_voltage

in an area called the syscall_table with the comment

// System call table.  Note that the entries must be ordered by their
// system call numbers (as defined in syscall.h), but it's OK if some numbers
// are skipped, or if some syscalls exist but aren't present in the table.

where syscall is a Tilera assembler macroL

.macro  syscall number routine
      .org    syscall_table + ((number) * 4)
      .word   routine
      .endm

And indeed, the use of .org makes sure that the offset of the entry in the syscall table matches the syscall number. The second argument is a symbol elsewhere in the hypervisor sources of code that implements the function.
In the case of syscall_get_voltage, the code is in hv/tilepro/hw_config.c:

int syscall_get_voltage(void)
{
  return(whatever);
}

So at this point, if something in the linux kernel manages to transfer control to text + 0x10000 + whatever the offset of the code in glue.S is, then a swint2 with argument HV_SYS_get_voltage will be made, which will transfer control in hypervisor mode to the swint2 handler, which will make a function call to syscall_get_voltage in the hypervisor.
But what is the offset in glue.S?
It is whatever you get incrementally by assembling glue.S, but in practice, it had better match the values given in the “public hypervisor interface” which is defined in hv/include/hv/hypervisor.h
hv/include/hv/hypervisor.h has things like

/** hv_flush_all */
#define HV_DISPATCH_FLUSH_ALL                     55
#if CHIP_HAS_IPI()
/** hv_get_ipi_pte */
#define HV_DISPATCH_GET_IPI_PTE                   56
#endif
/* added by QRC */
/** hv_get_voltage */
#define HV_DISPATCH_GET_VOLTAGE               57

and these numbers are similar to, but not identical to thos in syscall.h. Do not confuse them!
Once you add the entries to hypervisor.h, it is a good idea to check them against what is actually in the glue.o file. You can use tile-objdump for this:

tile-objdump -D glue.o

which generates:

...
00000700 <hv_get_ipi_pte>:
     700:	1fe6b7e070165000	{ moveli r0, -810 }
     708:	081606e070165000 	{ jrp lr }
     710:	400b880070166000 	{ nop ; nop }
     718:	400b880070166000 	{ nop ; nop }
00000720 <hv_get_voltage>:
     720:	1801d7e570165000	{ moveli r10, 58 }
     728:	400ba00070166000 	{ swint2 }
     730:	081606e070165000 	{ jrp lr }
     738:	400b280070165000 	{ fnop }
...

and if you divide HEX 720 by HEX 20 you get
I use bc for this sort of mixed-base calculating:

stewart$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
ibase=16
720/20
57
^Dstewart$

and we see that we got it right, the linkage number for get_voltage is indeed 57
Now let’s turn to Linux. The archtecture dependent stuff for Tilera is in src/sys/linux/arch/tile
The idea is to build a kernel module that will implement a sysfs interface to the new voltage and frequency calls.
The module get and set routines will call hv_set_voltage and hv_get_voltage.
The hypervisor call linkage is done by linker magic, via a file arch/tile/kernel/hvglue.lds, which is a linker script. In other words, the kernel has no definitions for these hv_ symbols, they are defined at link time by the linker script. For each hv call, it has a line like

hv_get_voltage = TEXT_OFFSET + 0x10740;

and you will recognize our friend 0x740 as the offset of this call in the hypervisor linkage area. Unfortunately, this doesn’t help with a separatley compiled module because it doesn’t have a way to use such a script (when I try it, TEXT_OFFSET is undefined, presumably that is part of the kernel main linker script. )
So to make a hypervisor call from a loadable module, you need a trampoline. I put them in arch/tile/kernel/qrc_extra.c, like this

int qrc_hv_get_voltage(void)
{
  int v;
  printk("Calling hv_get_voltage()n");
  v = hv_get_voltage();
  printk("hv_get_voltage returned %dn", v);
  return(v);
}
EXPORT_SYMBOL(qrc_hv_get_voltage);

The EXPORT_SYMBOL is needed to let modules use the function.
But where did hvglue.lds come from? It turns out it is not in any Makefile, but rather is made by a perl script in sys/hv/mkgluesyms.pl, except that this perl script optionally writes assembler or linker script output and I had to modify it to select the right branch. The modified version is mkgluesymslinux.pl and is invoked like this:

perl ../../hv/mkgluesymslinux.pl ../../hv/include/hv/hypervisor.h >hvglue.lds

The hints for this come from the sys/bogux/Makefile which does something similar for the bogux example supervisor.
linux/arch/tile/include/hv/hypervisor.h is a near copy of sys/hv/include/hv/hypervisor.h, but they are not automatically kept in sync.
Somehow I think that adding hypervisor calls is not a frequently exercised path.
To recap, you need to:

  • have the crazy idea to add hypervisor calls to change the chip voltage at runtime
  • edit hypervisor.h to choose the next available hv call number
  • edit glue.S to add, in just the right place, a macro call which will have the right offset in the file to match the hv call number
  • edit syscall.h to create a similar number for the SWINT2 interrupt dispatch table
  • edit intvec.S to add the new entry to the SWINT2 dispatch table
  • create the subroutine to actually be called from the dispatch table
  • run the magic perl script to transform hypervisor.h into an architecture dependent linker script to define the symbols for the new hv calls in the linux kernel
  • add trampolines for the hv calls in the linux kernel so you can call them from a  loadable module.
  • write a kernel module to create sysfs files and turn reads and writes into calls on the new trampolines
  • write blog entry about the above

 

How are non-engineers supposed to cope?

The Central Vac

Today the central vacuum system stuck ON.
The hose was not plugged in, and toggling the kick-plate outlet in the kitchen did not fix it.  That accounted for all the external controls.
The way this works is there is a big cylinder in the basement with the dust collection bin and large fan motor to pull air from the outlets, through the bin, and outside the house.  This is a great way to do vacuuming, because all the dusty air gets exhausted outside.
The control for the fan motor is low voltage that comes to two pins at each outlet.  When you plug in the hose, the pins are extended through the hose by spiral wires that then connect to a switch at the handle.  You can also active the fan by shorting the pins in the outlet with a coin.  Each outlet has a cover held closed by a spring.  You open the cover to insert the hose.  The covers generally keep all the outlets sealed except the one with the hose plugged in.
The outlets are all piped together with 1 1/2 inch PVC pipe to the inlet of the  central unit.  The contact pins at all the outlets are connected in parallel, so shorting any of them turns on the motor.
We also have a kickplate outlet in the kitchen – turn it on and sweep stuff into it.  The switch for that is activated by a lever that also uncovers the vacuum pipe.
I ran around the house to make sure nothing was shorting the terminals in the outlets.
Next, I went to the cellar to look at the central unit.  Unplugging it made it stop (good!) but plugging it back it made it start again.   That was not good.
I noticed that the control wires were connected to the unit via quick connects, so I unplugged them.  The unit was still ON, which meant the fault was inside the central unit.
I stood on a chair and (eventually) figured out that the top comes off, it is like a cookie tin lid.  Inside the top was the fan motor (hot!) and some small circuit board with a transformer, some diodes, and a black block with quick connect terminals.  The AC power went to the block and the motor wires went to the block.  I imagine that the transformer and the diodes produce low voltage DC for the control circuit, and the block is a relay activated by the low voltage.
Relays can stick ON if their contacts wear and get pitted, or there could be a short that applied power to the relay coil.
I blew the dust off the circuit board, and gave the block a whack with a stick.
That fixed it.
I just don’t see what a non-engineer would do in this situation, except let the thing run until the thermal overload tripped in the fan motor (I hope it has one!) and call a service person.  Even if the service folk know how to fix it without replacing the whole unit, it is going to cost $80 ro $100 for a service call.
I don’t have any special home-vacuum-system powers, but I have a general idea how it works, and a comfort level with electricity that I don’t mind taking the covers off things.  This time it worked out well.

The Dishwasher

For completeness, I should relate the story of our Kitchenaid dishwasher.  One day something went wrong with the control panel, so I took it apart.  It wasn’t working, and I thought I couldn’t make it much worse.  I was wrong about that.
I didn’t really know the correct disassembly sequence, and I took off one too many screws.  The door was open flat, and taking off the last screw let the control panel fall off, tearing a kapton flex PC board cable in two.  The flex cable connected the panel to some other circuit board.  I spent a couple of days carefully trying to splice the cable by scraping off the insulation and soldering jumpers to the exposed traces, but I couldn’t get the jumpers to stick.  New parts would have cost about $300, and the dishwasher wasn’t that new.  We eventually just bought a new Miele and that was the Right Thing To Do, because the Miele is like a zillion times better. It has built in hard-water softeners, and doesn’t etch the glasses, and doesn’t melt plastic in the lower tray, and is generally awesome.
So OK, sometimes you can fix it yourself, and sometimes you should really just call an expert.  How are you supposed to know which is the case?

The Garage Door Opener

Every few years, the opener stopped working.  It would whirr, but not open the door.  The first time this happened, I took it apart.  Now you should be really careful around garage door openers, because there is quite a lot of energy stored in the springs, but if you don’t mess with the springs, the rest of it is just gears and motors and stuff.
On mine, the cover comes off without disconnecting anything.  Inside there is a motor which turns a worm gear, which turns a regular gear, which turns a spur chain wheel, which engages a chain, which carries a traveller, which attaches to the top of the door.  The door is mostly counterbalanced by the springs.  With the cover off, you could see that the (plastic) worm gear had worn away the plastic main gear, so the motor would just spin.  The worm also drove a screw that carries along some contacts which set the full open and full closed travel, which stops and reverses the motor.  The “travel” adjustments just move the fixed contacts so the moving contacts hit them earlier or later.
An internet search located gear kits for 1/3 or 1/4 the price of a new motor, and I was able to fix it.
Last time the opener stopped working, however, the symptoms were different – no whirring.  The safety sensors appeared to be operational, because their pilot lights would blink when you blocked the light beam.  I suspected the controller circuit board had failed.  A replacement for that would be about 1/2 the cost of a new motor unit and I wan’t positive that was the trouble, so I just replaced the whole thing.  The new one was nicely compatible with the old tracks, springs, and sensors.
A few weeks later, my neighbor’s opener failed in the whirring mode, so we swiped the gears from my old motor unit with the bad circuit board and fixed it for free.

Take aways

Don’t be afraid to take things apart, at least if you have a reasonable expectation that you are not going to make it worse.
Or – Good judgement comes from experience, but experience comes from bad judgement. (Mulla Nasrudin)
… and just maybe, go ahead and get service contracts for complicated things with expensive repair parts, like that Macbook Pro or HE washing machine, particularly when the most-likely-to-fail part is electronic in nature.
So I usually get AppleCare, and we have a service contract for the new Minivan, and for the washing machine, but <not> for the clothes dryer, since it doesn’t appear to have any electronics inside.  I was able to fix that by replacing the clock switch myself.
But how are non-engineers supposed to cope?
 
 

What I do

I used Splasho’s “Up-Goer Five Text Editor.”  to write what I do, using only the most common 1000 words in English
In my work I tell computers what to do. I write orders for computers that tell them first to do this,and then to do that, and then to do this again.
Sometimes the orders tell the computer to listen for other orders from people. Then the orders tell the computer how to do what the people want, and then the orders tell the computer to show the people what the answer is.
I used to build computers. I would take one part, and another part, and many more parts, and put them together in just the right way so the computer would work right. Computers are all the same, they listen for an order, then do what it says, then listen for another order. We use them because they do this thing very very very very fast.

Equal Protection of the Law

I’ve been casting about for a way to follow up on my outrage of the government’s treatment of Aaron Swartz.
I wonder if the government’s conduct represents a violation of the equal protection clause of the constitution.
The 14th amendment says

…nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction the equal protection of the laws.

Evidently this doesn’t apply to the federal government as written, but in Bolling v. Sharpe in 1954, the Supreme court got to the same point via the Due Process clause of the 5th amendment.
I think all governments, state, federal, and local, are bound to provide equal protection.
In the Swartz case, we have the following mess

  • Congress writes vague laws
  • Congress fails to update those laws as technology and society evolve
  • Prosecutors use their discretion to decide who to charge
  • Prosecutors use pre-trial plea bargaining to avoid the scrutiny of the courts

It would be nice to have a case before the Supreme Court, leading to a clear ruling that equal protection applies to the actions of prosecutors. I suspect that would also give us proportional responses to crimes, although I am not sure about that.
In the medium term, Congress needs to act.  I’d suggest a law repealing all laws more than 20 years old.  Sunset provisions need to be in all laws. The ones that make ongoing sense can be reauthorized, but it will take a new vote every time.  (Maybe laws forbidding action by the government should be allowed to stand indefinitely, while laws forbidding action by the people will have limited terms.)
In the short term, we need action by the executive branch, to provide equal protection, control of pre-trial behavior of prosecutors, and accountability of both prosecutors and law enforcement.
 

AT&T Hell

Summary – AT&T customer service gives you bad information, tries to fix it and can’t, then lies about how it is “impossible”.
Update summary – Twitter works!  AT&T twitter team seems to have fixed the remaining problem.
“We don’t care, we don’t have to.” – Lily Tomlin
When I worked for IBM one summer, I wore a tie every day to see if I could do it.
When I drove an RX-7 in Palo Alto, I obeyed all the speed limits, to see if I could do it.
Last month I gave up my iPhone, to see if I could do it.
My daughter wanted an iPhone, but she’s in the middle of a two year contract on T-Mobile with a Palm Pixi.  My iPhone 4S is in the middle of a two year contract with AT&T that started October 2011.  It had the grandfathered unlimited data plan, and would be up for upgrade eligibility in May 2013.
On December 26, I called AT&T to see if I could port my number out and get a new number assigned to the iPhone, so I could let my daughter use it, while I would keep the T-Mobile phone, but with my number.  My number started out life a long time ago as a Verizon landline, with the number sequential to our home phone, so I am attached to it.  It is also on all my business cards and in countless contact lists.
AT&T said “sure”, when you port the number out, we’ll assign a new number to the iPhone and the contract will remain unchanged.
Life was good!  The daughter is happy, and I have a phone that is, um, interesting.  I also have an iPad, so don’t shed any tears about that!
A week or so later, we notice that the bill is $400ish.  There is an early termination charge on there!  You can’t actually figure out what the charge is from the online presentation.  You have to hunt up the pdf and look at the image of the printed bill.  This is a phone company, they know how to print phone bills, not how to build websites.
On the phone with customer service.  “When you ported out the number, that cancels your contract, and you get an early termination fee. Then you added a new line with new contract dates.”  I explained my call on the 26th, and the agent said, oh, well I can waive the early termination fee and make the contract be as it was. The only thing I can’t do is preserve the unlimited data plan.  So now the phone is on the 3GB plan.   I thought about balking, that unlimited plan made me feel like an old-time iPhone user, more privileged than the unwashed masses, but really, my usage is about 250 MB per month.  The iPad has a bigger screen.  So I let it slide.
A few days later, a website check showed the fees gone.  I noticed that the upgrade availabiltiy wording was different for this phone than for the other iPhone line, which also started October 2011, but decided to wait to see if other changes would catch up before calling.
A few days later, no change.  Called and learned that the second agent had waived the fees, but not fixed the contract dates.  I was assured that all would be fixed, and notes put in the account.
A few days later, no change to the upgrade language.  On calling, I was told that the contract would expire October 2013, as expected, but the upgrade eligibility date was July 2014.  What does that even mean?  After the contract is over, I can just create a new line, with a new contract and phone, and port the number!  It makes no sense to have an upgrade eligibility after the contract expiration.  Anyway, this is just stupid.  I explained that I had been told “the contract would be as it was” but the agent said there was just no way to change that in his system, the upgrade eligibility is tied to the phone number, not to the contract.
[By the way, this is also a lie, because, for example, if you are being stalked, you can request a new number and get it without any such collateral damage.]
I asked for a supervisor, who said
This should never have been allowed in the first place.  You can’t port out a number and keep the contract. It is our number.  The agents who tried to “fix” it for you went way outside our policies and made it worse.  What they should have done to correct their original mistake was to port your number back in, not to try and fix the contract. It can’t be fixed, it is impossible to change an upgrade eligibility date. It is tied to the phone number.
The supervisor said there were no higher supervisors to talk to, and no physical mail address to send a complaint to.
Well.  This supervisor was certainly polite, but either was really unable to fix the problems that AT&T created, or unwilling to do so.
At the moment, I have a nice iPhone, with a pleased daughter, but I am not pleased.  I made a perfectly sensible request.  I was told “Yes, of course you can do that” and now the account is scrambled beyond belief.
Recapping

  •  iPhone 4S, 14 months into a 24 month contract.
  • I ask to port out my number, and get a new number assigned to the phone,without contract changes.  I’m not paying them any less, I am not getting a new phone, just changing a few bits in a database somewhere about what is the number!
  • AT&T says “yes”
  • AT&T charges an early termination fee, an activation fee, cancels my unlimited data plan, restarts the 2 year contract, and resets the upgrade eligibility data.  I am not even angry about the activation fee, they deserve some fee for the work.
  • I complain.  AT&T waives the early termination fee, promises to fix the contract, but doesn’t
  • I complain.  AT&T promises to fix the contract, but only fixes the contract termination dates, the upgrade date is now 9 months after the contract expires.
  • I complain.  AT&T says “impossible to fix”
  • AT&T supervisor says “impossible to fix, and there is noone higher than me to ask”
The only thing that an upgrade date after contract expiration might mean is that AT&T would refuse to unlock the phone until it is 2 3/4 years old.  That would piss me off, but I don’t even want to ask them right now.

And by the way, the iPhone battery doesn’t work as well as it used to, and that 18 month upgrade was starting to look pretty attractive!  Instead, I will likely have to pay Apple $79 to fix it.  At least that is cheaper than the $99 Applecare I forgot to get, if nothing else goes wrong with the phone.
Now I am not a phone company marketing person, but I think I understand the essential economics of subsidized phones.  AT&T gives a substantial discount on the phone in trade for a contract commitment.  In fact, this is still a worse deal for the customer than buying an unlocked phone on a carrier with cheaper plans, like Virgin or T-Mobile, but AT&T doesn’t discount the monthly charges if you bring your own device.  That is just another way to screw the consumer.  So with AT&T, you may as well get the subsidy if you don’t mind sticking around for two years.  And they really make their money back so quickly that they let you upgrade (and restart the two year clock) after 18 months.
This is a simple deal – AT&T discounts the phone, I promise to keep paying their (high) monthly bills for two years.  This has nothing to do with the phone number!  Changing the number has utterly no effect on the money flows.
What about that number?  AT&T says it is their number, they can attach whatever they want to it.  But that is not true.  I had the number with Verizon. I ported the number to AT&T, I ported it out.  The FCC has “local number portability”.  The numbers are managed by CLECs (I think that is the term of art for phone companies) but they really can’t be taken away from users except for some arcane technical reasons.
What has happened here?  It cannot be “impossible” to fix these sorts of problems.  There may be software limitations, but those are fixable.  Or they could merely write a note to themselves saying “Yes, the system says this contract runs until July 2014, but when the customer asks, in May 2013, for an upgrade, just waive the fees.  And when the customer cancels the contract in October 2013, waive any cancellation fee.”
Instead, they’ve spent a lot of money on customer service phone calls, which are not cheap. They’ve enraged a long-standing customer who has alternatives. They’ve provided more information to the entire internet about just how bad their service and systems are.  There is no good result for AT&T here. They’ve not gained any income. They haven’t kept control of their precious number. They may well lose me as a customer come October.  (That Nexus 4 on T-Mobile is looking pretty good, or a nice unlocked iPhone 5S or whatever.) And they are defending positions and policies that make no sense competitively or economically.
I’m not sure of the next step for me.  Probably I will tweet the URL of this blog entry to @ATTCustomerCare.  At this point, AT&T can fix the problems, or they can provide me a source of continuing amusement.  There’s a rumor that sometimes people get results by writing the CEO.  At a minimum that will cost them even more money to deal with my letter.
UPDATE – I tweeted this URL to @ATTCustomerCare and they actually answered, got me on the telephone, and fixed this, well enough.  Which is to say they can’t fix it in the database, but they’ve added a special note telling other folks to honor an upgrade request on or after the correct date.  Works for me.  (1/16/2013)
You can sort of understand how enterprise software can become unwieldy, to the point where it seems easier to correct software problems and poor specifications by adding layer upon layer of special fixes and exceptions and end-runs, but it is not good for customers or efficiency to do it that way.
 
 
 
 

Buying a lemon

Last month we got a shiny new Stop and Shop grocery store here in Wayland.  They’ve been having various grand opening specials so we have been dropping by.  I went over there Sunday evening to buy blueberries (two pints for $3! in January!) but they were out of stock.  I managed to leave the shopping list at home, so I had to go by my wits, which is really not such a good idea.
I checked out using the ScanIt! gadget, and this time I remembered to wait for the coupon accepted tone before dropping my coupon in the slot.  Last time I had to have staff fish my should-have-worked coupon out of the guts of the machine and fix it, but I digress.
After finishing, I called Cathy to see what I had forgotten and she told me to remember to get a lemon and to get a rain check for the 10/$10 frozen vegetables they had run out of.  (I already had a rain check for the blueberries).
I didn’t get another ScanIt! machine for one lemon, so I went over to produce and picked out a nice lemon.  66 cents each!  Should be 50. I carefully put it on the scale, typed in the produce code, and entered my quantity,  The machine prints a scannable sticker, which I stuck on the lemon.
At the self-checkout I scanned the lemon and touched “pay”.  While the machine thought about it, I got exact change from my wallet and began to feed in coins.  Around about 55 cents, I noticed the amount due was $4.03.  There was no cancel button.  At that point I looked at the lemon, and the sticker said “7” rather than “1”.   I think the produce machine must have a calculator style keypad, with 7 at the upper left, rather than a phone keypad with 1 at the upper left.
I think this is 1200 baud modem training to blame.  In those days, you typed way ahead of the computer, and since you knew what it was going to do, there was no real need to actually look at the screen when it caught up.
At this point, there was nothing to do but press  the I Need Help button and look sheepish.
A nice girl with bright orange hair came over and I explained.  I think this was a new one.  She scanned her superuser card and after flipping through some screens said “I don’t think there is any way to change an order after you start paying…. But I can refund the money.”
[Side note: The machine refunded a different collection of coins that happened to add up to 55 cents, rather than returning my coins.  I suppose this lets you overload the change and the refund mechanism.]
After she left, I entered 1 lemon through the produce lookup screens, and again hit pay, and started putting my coins in.  This time, after a few coins, the machine said $5 something or other to go.  I had done it again!  Evidently the 7 virtual lemons were still on the tab, as well as the one real lemon.  I had to call for help again.
The same girl with the bright orange hair came over, and apologized to me, apparently for my being an idiot, and this time refunded the money, and deleted the 7 lemon line item, leaving only one lemon.  I successfully paid, and fled.
It is a mixed blessing that the store was essentially deserted.  No one was there to watch my performance, but neither was there any press of work to distract the staff from chuckling over the befuddled customer.
And I forgot to get the rain check for the frozen peas.

Aaron Swartz

Aaron Swartz, 26, committed suicide the other day, evidently hounded to his death by overzealous prosecutors.
I didn’t know Mr. Swartz, and I don’t condone his actions of a couple of years ago, where it is alleged that he attached equipment to the MIT computer network to steal academic articles from the JSTOR database in order to release them to the public.
However, the more I learn about the conduct of the government in prosecuting Mr. Swartz, the angrier I get.
For those lacking any context, go read what Larry Lessig had to say in

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

or what Cory Doctorow had to say in

http://boingboing.net/2013/01/12/rip-aaron-swartz.html

Here is the letter I’ve sent to my Senator, Elizabeth Warren.  I’ve sent a similar letter to Sen. John Kerry

I call to your attention the recent suicide of Aaron Swartz.  It looks
very much to me like the US Justice Department hounded him to his
death by overzealous prosecution of a victimless “crime” if it even was
a crime.

Larry Lessig writes on the case:
http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully
I would like to know what you are doing to hold the prosecutors and
their bosses at Justice to account for this affair.
I voted for you in part for your history of representing the issues
of ordinary people against big business.  Please also represent us
against the oppressive power of government.
-Larry Stewart
I’ve sent the following email to Rafael Rief, President of MIT

I understand that the Swartz affair started before you became president of MIT, but I think you should explain to the community what happened, why it happened, and exactly what principles MIT holds.

From what I’ve heard, MIT provided the pretext necessary for the US Attorney ****** to hound Aaron Swartz to his death.

 See, for example, Larry Lessig’s account at

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

It may well be that Mr. Swartz was guilty of something, and it may be that MIT favored prosecution, but once MIT started such a ball rolling MIT became responsible in part for the damage it caused.  At the minimum, MIT had an obligation to track the case and to speak out loudly when it began to go off the rails of proportional justice in such a dramatic way.

-Larry Stewart ’76

(name removed because I am not sure I got it right)

I don’t know what the right answers are in this case, but I am beginning to think we should handle failures of justice in the same way we handle airplane crashes.  Do we need an equivalent of the National Transportation Safety Board to investigate?  Such a group could find out what happened, why it happened, and what legal, procedural, training, and technical measures are needed to keep it from happening again.  And their reports and proceedings should be open.
We now have so many laws and crimes, and so many are ill-defined, that likely everybody is “guilty” of something.  When the full oppressive power of government can be brought to bear on anyone at the discretion of individuals or groups with their own agenda, then no one is safe.

 UPDATE

About an hour after I wrote to MIT President Reif, he wrote to the community.  Obviously he’s well ahead of me on this one, since his message must have already been in progress.   Professor Hal Abelson will be leading a thorough analysis of MIT’s involvement.  I await the report with interest.
http://web.mit.edu/newsoffice/2013/letter-on-death-of-aaron-swartz.html
 
 
 
 

Another thing not to do

At the day job, I’ve been writing a new version of nbd-client.  Instead of handing an open tcp socket to the kernel, it hands the kernel one end of a unix domain socket and keeps the other end for itself.  This creates a block device where the data is managed by a user mode program on the same system.
In regular nbd-client, the last thing the program does is call ioctl(fd, NBD_DO_IT), which doesn’t return.  The thread is used by the device driver to read and write the socket without blocking other activity in the block layer.
Because I need the program around to do other work, I called pthread_create to make a thread to call the ioctl.
Then I ran my program under gdb (as root!).
In another window, I typed dd if=/dev/nbd0 bs=4096 count=1
In the gdb window I saw
nbd-userland.c:525: server_read_fn: Assertion `0′ failed.
and my dd hung, and the gdb hung, and neither could be killed by ^C
I was able to get control back by using the usual big hammer, kill -9 <gdb>
So what happened?  My user mode thread hit an assertion, and gave control to gdb, which tried to halt the other threads in the process, which didn’t work because the thread in the middle of the ioctl was in the middle of something uninterruptible, and the gdb thread trying to do this also became uninterruptible while waiting.
It is going to be hard to debug this program like this.
The fix, however, is fairly clear:  use fork(2) instead of pthread_create() to create a thread to call ioctl. It will be isolated from the part of the program hitting the assertion.
Older and wiser,
Larry
By the way, when you are trying to figure out where processes are stuck, look at the “wchan” field of ps axl.  It will be a kernel symbol that will give you a clue about what the thread is waiting for.
UPDATE
Experience is what lets you recognize a mistake when you make it again.
The underlying bug was sending too much data on the wire.  Like this:
struct network_request_header {
uint64_t offset;
uint32_t size;
};
write(fd, net_request, sizeof(struct network_request_header);
Well, no.  sizeof(struct network_request_header) turns out to be 16, rather than, say, 12.  If you think about it, this makes perfect sense, because otherwise an array of these things would have unaligned uint64_t’s every other time.  You can’t do network I/O this way, especially if the program on the other end uses a different language or different compiler.
gdb, it turns out, has a feature:  __attribute__((packed)) that makes this work, but it is not portable to other compilers.

Home Networking Troubleshooting

Sometimes a technological scramble is triggered by the most mundane events.  In this case, the season finale of “X Factor”.
Last night, there was a special church choir rehearsal for the Christmas Eve services, and all seven of Win’s and my kids went.  Since the rehearsal would overlap the broadcast finale of X Factor, Erica asked Win to record it.  Maybe the appearance of 1 Direction had something to do with it as well.
We used to have Replay TVs to solve things like this, and cable TV to deliver the bits, but the conversion to digital TV and the crazy anti-customer behavior of Comcast has changed all that.  We don’t get cable, and the TV is hooked up to an antenna.  We’ve also got a Silicon Dust HDHomeRun network tuner connected to the antenna on my front porch, so we can watch TV on any computer as well.  Win has the copy of EyeTV that came with the HDHomeRun, and he planned to record the show.
About an hour before air time, he called to ask me about video artifacts and bad audio.   I said I’d take a look.
I used hdhomerunner (a now lost Google Code project to develop an open source HDHomeRun control program) and directed the video to VLC running on my Macbook Pro.  Indeed, the video was blocky and the audio spotty.
I power cycled the HDHomeRun, replaced the ethernet cable, and plugged it into a different switch port on the 16-port gigE switch.  No change.  I looked for firmware upgrades, and found the device running 4-year old firmware.  The upgrade went smoothly, but there was no change in video quality.
After sitting and swiveling back and forth for a while, I went back downstairs and plugged the device into the 100 Mbps switch instead of the 1000 Mbps switch.  I had some vague memory that the negotiation doesn’t always work right.  This fixed the problem and I was able to watch good video and audio with VLC.
Win called back to report his video was still breaking up.  This suggested some other networking problem between the houses.
Backgound.  Win and I are neighbors, and we have a conduit between the houses with a couple of outdoor rated Cat V cables and a 6-fiber multimode fiber.  One pair of fibers are connected to 1000base-SX media converters at the two ends and plugged into the house gigE switches.
I remembered once setting up netperf on the home servers, and indeed it was still installed.  Win’s house to mine reported 918 Mbps, but mine to Win’s reported 16! At this point, there wasn’t much time to debug the networking, and X Factor was about to start.
I remembered that VLC can record an  input video stream, and set that up to record the program on my Macbook.  (I had 45 GB free on disk, and the program was running at 2 Megabytes/second, so it would take 14 GB for the two hours.  No doubt there is a way to transcode, but not enough time to learn how to do it!)
The VLC recording froze once, at about the one hour point, but I only missed a couple of minutes.  I copied the files to an external USB drive for sneakernet delivery.
This morning, Win and I started taking a look at the networking.  First, we got netperf running on our respective Macbook and iMacs, in order to figure out if the link was bad or one of the home servers.  I was able to talk both ways to my server at about 600 Mbps, and Win to his at about 95 Mbps.  Win’s results are explained by a fast Ethernet hop somewhere, but all these rates are way above the pitiful 16 Mbps across the fiber.
Next Win wiggled his connectors, dropping the path to about 6 Mbps.  We swapped the transmit and receive fibers at both ends, and the direction of the problem did not change.  It was looking more and more like a bad media converter.
I was staring at the wiring in my basement, wondering if we could use the copper link as backup while waiting for parts.  It never worked very well, but we did use it to cross connect our DMZs before the firewalls at one point.  I found the cable, and found it plugged into the ethernet switch on the back of my FIOS router – with LINK active!  Huh?  What was it plugged into at Win’s end?  He reported it plugged into a small switch, but that it wasn’t easy to tell what else was plugged in.
For experiment, we unplugged the copper link and … Win lost Internet access.  Evidently (a) his routes were set to use the Serissa business FIOS rather than his home Comcast, and (b) the traffic was going over this moldy waterlogged CatV instead of our supposedly shiny gigabit fiber.  Now the gears are turning.  If we did have a loop in the switch topology, then it was entirely possible that one direction between the houses would use the fiber while the other direction would use the copper.  I don’t know much about how these cheap switches figure out things like that.    We tried unplugging the fiber, forcing all traffic onto copper, but the netperf results were much worse.  ping seemed to work, and ping -c 1000 gave fairly good results, but ping -c 1500 had a lot of trouble.  That would explain why, generally, ping and ssh seemed to work but netperf gave bad results.
We unplugged the copper and plugged the fiber back in, and after a few seconds, the asymmetrical performance resumed.  I’ve placed an order for another media converter, and we’ll see if that fixes it.  At least they now cost half as much as when we got the first pair!
So, there was a lot going on here.
The hdhomerun was plugged into a gigabit switch, and working poorly.  Changing to fast Ethernet fixed that.
The topology loop was routing off-site traffic over a poor copper link, but it was working well enough that we didn’t notice.
The media converter is probably bad, working well in one direction but not the other, and probably that explains the poor video quality .
And Erica gets to watch 1 Direction.
How are just plain folks supposed to figure this stuff out?
UPDATE
The new media converter arrived… and didn’t fix the problem.  Well we have a spare now!  The actual problem was a bad 8-port switch in Win’s basement, which we belatedly figured out once ruling out the fiber.  We could have tested the link standalone by plugging computers into both ends, but we did’t think of it.  Does gigE need crossover cables to do that? Or is the magic echo cancellation make crossover cables unneccesary?