Big Data

I propose a definition of Big Data.  Big Data is stuff that you cannot process within the MTBF of your tools.
Here’s the story about making a backup of a 1.1 Terabyte filesystem with several million files.
A few years ago, Win and I built a set of home servers out of mini-ATX motherboards with Atom processors and dual 1.5 Terabyte drives.  We built three, one for Win’s house, that serves as the compound IMAP server and such like, one for my house, which mostly has data and a duplicate DHCP server and such like, and one, called sector9, which has the master copy of the various open source SiCortex archives.
These machines are so dusty that it is no longer possible to run apt-get update, and so we’re planning to just reinstall more modern releases.  In order to do that, it is only prudent to have a couple of backups.
In the case of sector9, it has a pair of 1.5 T drives set up as RAID 1 (mirrored).  We also have a 1.5T drive in an external USB case as a backup device.  The original data is still on a 1T external drive, but with the addition of this and that, the size of sector9’s data had grown to 1.1T.
I decided to make a new backup.  We have a new Drobo5N NAS device, with 3 3T drives, set up for single redundancy, giving it 6T of storage.  Using 1.1T for this would be just fine.
There have been any number of problems.
Idea 1 – mount the Drobo on sector9 and use cp -a or rsync to copy the data
The Drobo supports only AFP (Apple Filesharing Protocol) and CIFS (Windows file sharing).  I could mount the Drobo on sector9 using Samba, except that sector9 doesn’t already have Samba, and apt-get won’t work due to the age of the thing.
Idea 2 – mount the Drobo on my Macbook using AFP, and mount sector9 on the Macbook using NFS.
Weirdly, I had never installed the necessary packages on sector9 to export filesystems using NFS.
Idea 3 – mount the Drobo on my Macbook using AFP and use rsync to copy files from sector9.
This works, for a while.  The first attempt ran at about 3 MB/second, and copied about 700,000 files before hanging, for some unknown reason.  I got it unwedged somehow, but not trusting the state of everything, rebooted the Macbook before trying again.
The second time, rsync took a couple of hours to figure out where it was, and resumed copying, but only survived a little while longer before hanging again. The Drobo became completely unresponsive.  Turning it off and on did not fix it.
I called Drobo tech support, and they were knowledgeable and helpful.  After a long sequence of steps, invoving unplugging the drives, and restarting the Drobo without the mSata SSD plugged in, we were able to telnet to it management port, but the Drobo Desktop management application still didn’t work. That was in turn resolved by uninstalling and reinstalling Drobo Desktop (on a Mac! Isn’t this disease limited to PCs?)
At this point, Drobo tech support asked me to use the Drobo Desktop feature to download the Drobo diagnostic logs and send them in….but the diagnostic log download hung.  Since the Drobo was otherwise operational, we didn’t pursue it at the time.  (A week later, I got a followup email asking me if I was still having trouble, and this time the diagnostic download worked, but the logs didn’t show any reason for the original hang.)
By the way, while talking to Drobo tech support, I discovered a weath of websites that offer extra plugins for Drobos (which run some variant of linux or bsd).  They include an nfs server, but using it kind of voids your tech support, so I didn’t
A third attempt to use rsync ran for a while before mysteriously failing as well.  It was clear to me that while rsync will synchronize two filesystems, it might never finish if it has to check its work from the beginning and doesn’t last long enough to finish.
I was also growing nervous about the second problem with the Drobo, that it uses NTFS, not a a linux filesystem.  As such, it was not setting directory dates, and was spitting warnings about symbolic links.  Symbolic links are supposed to work on the Drobo.  In fact, I could use ln -s in a Macbook shell just fine, but what shows up in a directory listing is subtly different than what shows up in a small rsync of linux symbolic links.
Idea 4:  Mount the drobo on qadgop (my other server, which does happen to have Samba installed) and use rsync.
This again failed to work for symbolic links, and a variety of attempts to change the linux smb.conf file in ways suggested by the Internet didn’t fix it.  There were suggestions to root the Drobo and edit its configuration files, but again, that made me nervous.
At this point, my problems are twofold:

  • How to move the bits to the Drobo
  • How to convince myself that any eventual backup was actually correct.

I decided to create some end-to-end check data, by using find and md5sum to create a file of file checksums.
First, I got to wondering how healthy the disk drives on sector9 actually were, so I decided to try SMART. Naturally, the SMART tools for linux were not installed on sector9, but I was able to download the tarball and compile them from sources.  Alarmingly, SMART told me that for various reasons I didn’t understand, both drives were likely to fail within 24 hours.  They told me the external USB drive was fine.  Did it really hold a current backup?  The date on the masking tape on the drive said 5/2012 or something about a year old.
I started find jobs running on both the internal drives and the external:

find . -type f -exec md5sum {} ; >s9.md5
find . -typef -exec md5sum {} ; >s9backup.md5

These jobs actually ran to completion in about 24 hours each.  I now had two files, like this:

root@sector9:~# ls -l *.md5
-rw-r--r-- 1 root root 457871770 2013-07-08 01:24 s9backup.md5
-rw-r--r-- 1 root root 457871770 2013-07-07 21:39 s9.md5
root@sector9:~# wc s9.md5
3405297 6811036 457871770 s9.md5

This was encouraging, the files were the same length, but diffing 450 MB files is not for the faint of heart, expecially since find doesn’t enumerate them in the same order.  I had to sort each file, then diff the sorted files.  This took a while, but in fact the sector9 filesystem and its backup were identical.  I resolved to use this technique to check any eventual Drobo backup.  It also relieved my worries that the internal drives might fail at any moment.  I also learned that the sector9 filesystem had about 3.4 million files on it.
Idea 5: Create a container file on the Drobo, with an ext2 filesystem inside, and use that to hold the files.
This would solve the problem of putting symbolic links on the Drobo filesystem (even though it is supposed to work!) It would also fix the problem of NTFS not supporting directory timestamps or linux special files.  I was pretty sure there would be root filesystem images in the sector9 data for the SiCortex machine and for its embedded processors, and I would need special files.
But how to create the container file? I wanted a 1.2 Terabyte filesystem, slightly bigger than the actual data used on sector9.
According to the InterWebs, you use dd(1), like this:
dd if=/dev/zero of=container.file block=1M seek=1153433 count=0
I tried it:
dd if=/dev/zero of=container.file block=1M seek=1153433
It seemed to take a long time, so I thought probably it was creating a real file, instead of a sparse file, and went to bed.
The next morning it was still running.
That afternoon, I began getting emails from the Drobo that I should add more drives, as it was nearly full, then actually full.  Oops. I had left off the count=0.
Luckily, deleting a 5 Terabyte file is much faster than creating one!  I tried again, and the dd command with count=0 ran very quickly.
I thought that MacOS could create the filesystem, but I couldn’t figure out how.  I am not sure that MacOS even has something like the linux loop device, and I couldn’t figure out how to get DiskUtility to create a unix filesystem in an image file.
I mounted the Drobo on qadgop, using Samba, and then used the linux loop device to give device level access to the container file, and I was able to mkfs an ext2 filesystem on it.
Idea 6: Mount the container file on MacOS and use rsync to write files into it.
I couldn’t figure out how to mount it!  Again, MacOS seems to lack the loop device.  I tried using DiskUtility to pretend my container file was a DVD image, but it seems to have hardwired the notion that DVDs must have ISO filesystems.
Idea 7: Mount the Drobo on linux, loop mount the container, USB mount the sector9 backup drive.
This worked, sort of.  I was able to use rsync to copy a million files or so before rsync died.  Restarting it got substantially further, and a third run appeared to finish.
The series of rsyncs took several couple of days to run.  Sometimes they would run at about 3 MB/s, and sometimes at about 7 MB/sec.  No idea why.  The Drobo will accept data at 11 MB/sec using AFP, so perhaps this was due to slow performance of the USB drive.  The whole copy took close to 83 hours, as calculated by 1.1 T at 3 MB/sec.
Unfortunately, df said the container filesystem was 100% full and the final rsync had errors “previously reported” but scrolled off the screen. I am pretty sure the 100% is a red herring, because linux likes to reserve 10% of space for root, and the container file was sized to be more than 90% full.
I reran the rsync, under a script(1) to get a log file, and found many errors of the form “can’t mkdir <something or other>”.
Next, I tried mkdir by hand, and it hung.  Oops.  Ps said it was stalled in state D, which I know to be disk wait.  In other words, the ext2 filesystem was damaged.  By use of kill -9 and waiting, I was able to unmount the loop device and Drobo, and remount the Drobo.
Next, I tried using fsck to check the container filesystem image.
fsck takes hours to check a 1.2T filesystem.  Eventually, it started asking me about random problems and could I authorize it to fix them?  After typing “y” a few hundred times, I gave up and killed the fsck and restarted it fsck -p to automatically fix problems.  Recall that I don’t actually care if it is perfect, because I can rerun rsync and check the final results using my md5 checksum data.
The second attempt to run fsck didn’t work either:
root@qadgop:~# fsck -a /dev/loop0
fsck 1.41.4 (27-Jan-2009)
/dev/loop0 contains a file system with errors, check forced.
/dev/loop0: Directory inode 54583727, block 0, offset 0: directory corrupted

/dev/loop0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
Hoping that the fsck -a had fixed most of the problems, I ran it a third time again without -a, but I wound up typing ‘y’ a few hundred more times.  fsck took about 300 minutes of CPU time on the Atom to do this work and left 37 MB worth of files and directories in /lost+found.
With the container filesystem repairs, I started a fourth rsync, which actually finished, transferring another 93 MB.
Next step – are the files really all there and all the same?  I’ll run the find -exec md5sum to find out.
Um.  Well.  What does this mean?
root@qadgop:~# wc drobos9.md5 s9.md5
3526171 7052801 503407914 drobos9.md5
3405297 6811036 457871770 s9.md5
The target has 3.5 million files, while the source has 3.4 million files!  That doesn’t seem right.  An hour of running “du” and comparing the top few levels of directories shows that while rerunning rsync to finish interrupted copies works, you really have to use the same command lines.  I had what appeared to be a complete copy one level below a partial copy.  After deleting the extra directories, and using fgrep and sed to rewrite the path names in the file of checksums, I was finally able to do a diff of the sorted md5sum files:
Out of 3.4 million files, there were 8 items like this:
< 51664d59ab77b53254b0f22fb8fdb3a8 ./sicortex-archive/stash/97/sha1_97e18c8e2261b09e21b0febd75f61635d7631662_64088060.bin

> 127cc574dcb262f4e9e13f9e1363944e ./sicortex-archive/stash/97/sha1_97e18c8e2261b09e21b0febd75f61635d7631662_64088060.bin
1402503c1402502
and one like this:
> 8d9364556a7891de1c9a9352e8306476  ./downloads.sicortex.com/dreamhost/ftp.downloads.sicortex.com/ISO/V3.1/.SiCortex_Linux_V3.1_S_Disk1_of_2.iso.eNLXKu
The second one is easier to explain, it is a partially completed rsync, so I deleted it.  The other 8 appear to be files that were copied incorrectly!  I should have checked the lengths, because these could be copies that failed due to running out of space, but I just reran rsync on those 8 files in –checksum mode.
Final result: 1.1 Terabytes and 3.4 million files copied.  Elapsed time, about a month.
What did I learn?

  • Drobo seems like a good idea, but systems that ever need tech support intervention make me nervous.  My remaining worry about it is proprietary hardware.  I don’t have the PC ecosystem to supply spare parts.  Perhaps the right idea is to get two.
  • Use linux filesystems to hold linux files.  It isn’t just Linus’ and his files that vary only in capitalization, it is also the need to hold special files and symlinks. Container files and loop mounting works fine.
  • Keep machines updated. We let these get so far behind that we could no longer install new packages.
  • A meta-rsync would be nice, that could use auxiliary data to manage restarts.
  • Filesystems really should have end-to-end checksums.  ZFS and BTRFS seem like good ideas.
  • SMB, or CIFS, or the Drobo, or AFP are not good at metadata operations, it was a fail to try writing large numbers of individual files on the Drobo, no matter how I tried it.  SMB read write access to a single big file seems to be perfectly reliable.

 

wget

I am struggling here decide whether the Bradley Manning proscecutors are disingenuous or just stupid.
I am reacting here to Cory Doctorow’s report that the government’s lawyers accuse Manning of using that criminal spy tool wget.
Notes from the Ducking Stool
I am hoping for stupid, because if they are suggesting to the jury facts they know not to be true, then that is a violation of ethics, their oaths of office, and any concept of justice.
Oh, and wget is exactly what I used, the last time I downloaded files from the NSA.
Really.
A while back, the back issues of the NSA internal newsletter Cryptolog were declassified so I downloaded the complete set.  I think the kids are puzzled about why I never mind having to wait in the car for them to finish something or other, but it is because I am never without a large collection of fascinating stuff.
Here’s how I got them, after scraping the URLs out of the agency’s HTML:
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_01.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_02.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_03.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_04.pdf

. . .
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_132.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_133.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_134.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_135.pdf
wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_136.pdf

Email Disaster Recovery and Travel adventures

Cathy is off to China for a few weeks. She wanted email access, but not with her usual laptop.
She uses Windows Vista on a plasticy HP laptop from, well, the Vista era.  It is quite heavy, and these days quite flaky.  It has a tendency to shut down, although not for any obvious reason, other maybe than age, and being Vista running on a plasticy HP laptop.
I set up the iPad, but Cathy wanted a more familiar experience, and needed IE in order to talk to a secure webmail site, so we dusted off an Asus EEE netbook running Windows XP.
I spent a few hours trying to clear off several years off accumulated crapware such as three different search toolbars attached to Internet Explorer, then gave up and re-installed XP from the recovery partition.  123 Windows Updates later, it seemed fine, but still wouldn’t talk to the webmail site.  It turns out that Asus thoughtfully installed the open source local proxy server Privoxy, with no way to uninstall it.  If you run the Privoxy uninstall, it leaves you with no web access at all.  I finally found Interwebs advice to also uninstall the Asus parental controls software, and that fixed it.
Next, I installed Thunderbird, and set it up to work with Cathy’s account on the family compound IMAP server.  I wanted it to work offline, in case of spotty WiFi access in China, but after setting that up, so I “unsubscribed” to most of the IMAP folders and let it download.  Now Cathy’s inbox has 34,000 messages in it, and I got to thinking “what about privacy?”  After all, governments, especially the United States, claim the right to search all electronic devices at the border, and it is also commonly understood that any electronic device you bring to China can be pwned before you come back.
Then I found a setting that tells Thunderbird to download only the last so many days for offline use.  Great!  But it had already downloaded all 6 years of back traffic.  Adjacent, there is a setting for “delete mail more than 20 days (or whatever) old.”
You know what happens next!  I turned that on, and Thunderbird started deleting all Cathy’s mail, both locally and on the server.  Now there is (farther down the page), fine print that explains this will happen, but I didn’t read it.
Parenthetically, this is an awful design.  It really looks like a control associated with how much mail to keep for offline use, but it is not.  It is a dangerous, unguarded, unconfirmed command that does irreversible damage.
I thought this was taking too long, but by the time I figured it out, it was way too late.
So, how to recover?
I have been keeping parallel feeds from Cathy’s email, but only since March or so, since I’ve been experimenting with various spam supression schemes.
I had made a copy of Cathy’s .maildir on the server, but it was from 2011.
But wait! Cathy’s laptop was configured for offline use, and had been turned off.  Yes!  I opened the lid and turned off WiFi as quickly as possible, before it had a chance to sync.  (Actually, the HP has a mechanical switch to turn off WiFi, but I didn’t know that.)  I then changed the username/password on her laptop Thunderbird to stop further syncing.
Next, since the horse was well out of the barn, I made a snapshot of the server .maildir, and of the HP laptop’s Thunderbird profile directories. Now, whatever I did, right or wrong, I could do again.
Time for research!
What I wanted to do seemed clear:  restore the off-line saved copies of the mail from the HP laptop to the IMAP server.  This is not a well-travelled path, but there is some online advice:
http://www.fahim-kawsar.net/blog/2011/01/09/gmail-disaster-recovery-syncing-mail-app-to-gmail-imap/
https://support.mozillamessaging.com/en-US/kb/profiles
The general idea is:

  1. Disconnect from the network
  2. Make copies of everything
  3. While running in offline mode, copy messages from the cached IMAP folders to “Local” folders
  4. Reconnect to the network and sync with the server. This will destroy the cached IMAP folders, but not the new Local copies
  5. Copy from the Local folders back to the IMAP server folders

Seems simple, but in my case, there were any number of issues:

  • Not all server folders were “subscribed” by Thunderbird, and I didn’t know which ones were
  • The deletion was interrupted at some point
  • I didn’t want duplicated messages after recovery
  • INBOX was 10.3 GB (!)
  • The Thunderbird profile altogether was 23 GB (!)
  • The HP laptop was flakey
  • Cathy’s about to leave town, and needs last minute access to working email

One thing at a time.
Tools
I found out about  “MozBackup” and used it to create a backup copy of the HP laptop’s profile directory.
MozBackup
MozBackup creates a zip file of the contents of a Thunderbird profile directory, and can restore them to a different profile on a different computer, making configuration changes as appropriate. This is much better than hand editting the various Thunderbird configuration files.
Hardware problems
As I mentioned, the HP laptop is sort of flakey.  I succeeded in copying the Thunderbird profile directory, but 23 GB worth of copying takes a long time on a single 5400 rpm laptop disk.  I tried copying to a Mybook NAS device, but it was even slower.  What eventually worked, not well, but adequately, was copying to a 250GB USB drive.
I decided to leave the HP out of it, and to do the recovery on the netbook, the only other Windows box available.  I was able to create a second profile on the netbook, and restore the saved profile to it, slowly, but I realized Cathy would leave town before I finished all the steps, taking the netbook with her.  Back to the HP.
First I tried just copying the IMAPMail subfolder files of mbox files and msf files to LocalFolders. This seemed to work, but Thunderbird got very confused about it.  It said there were 114000 messages in Inbox, rather than 34000.  This shortcut is a dead end.
I created a new profile on the HP, and restored the backup using MozBackup (which took 2 hours), and started it in offline mode.  I then tried to “select-all” in Inbox to copy them to a local folder.  Um.  No.  I couldn’t even get control back.  Thunderbird really cannot select 34000 messages and do anything.
Because I was uncertain about the state of the data, I restored the backup again (another 2 hours).
This time, I decided to break up Inbox into year folders, each holding about 7000 messages.  The first one worked, but then the HP did an undexpected shutdown during the second, and when it came back, Inbox was empty! The Inbox mbox file had been deleted.
I did another restore, and managed to create backup files for 2012 and 2011 messages, before it crashed again. (And Inbox was gone AGAIN)
The technique seemed likely to eventually work, but it would drive me crazy.  Or crazier.
I was now accumulating saved Local Folder files representing 3 specific years of Inbox.  I still had to finish the rest, deal with Sent Mail, and audit about 50 other subfolders to see if they needed to be recovered.
I wasn’t too worried about all the archived subfolders, since they hadn’t changed in ages and were well represented by my 2011 copy of Cathy’s server .maildir
Digression
What about server backups?  Embarassing story here!  Back in 2009, Win and I built some nice mini-ATX atom based servers with dual 1.5T disks run in mirrored mode for home servers.  Win’s machine runs the IMAP, and mine mostly has data storage.  Each machine has the mirrored disks for reliabiltiy and a 1.5T USB drive for backup.  The backups are irregularly kept up to date, and in the IMAP machines case, not recently.
About 6 months ago, I got a family pack of CrashPlan for cloud backup, and I use it for my Macbook and for my (non IMAP) server, but we had never gotten around to setting up CrashPlan for either Cathy’s laptop or the IMAP server.
A few months ago, we got a Drobo 5N, and set it up with 3 3T disks, for 6T usable storage, but we haven’t gotten it working for backup either.  (I am writing another post about that.)
So, no useful server backups for Cathy’s mail.
Well now what?
I have a nice Macbook Pro, unfortunately, the 500 GB SSD has 470 GB of data, not enough for one copy of Cathy’s cached mail, let alone two.  I thought about freeing up space, and copied a 160 GB Aperture photo library to two other systems, but it made me nervous to delete it from the Macbook.
I then tried using Mac Thunderbird to set up a profile on that 250 GB external USB drive, but it didn’t work because the FAT filesystem couldn’t handle Mac Thunderbird’s need for fancy filesystem features like ACLs, but this triggered an idea!
First, I was nervous about using Mac Thunderbird to work on backup data from a PC. I know that Thunderbird profile directories are supposed to be cross-platform, but the config files like profile.ini and prefs.js are littered with PC pathnames.
Second, the USB drive is slow, even if it worked.
Up until recently, I’ve been using a 500 GB external Firewire drive for TimeMachine backups of the Macbook.  It still was full of Time Machine data, but I’ve switched to using a 1T partition on the Drobo for TimeMachine.  I also have the CrashPlan backup.  So I reformatted the Firewire Drive to HFS, and plugged it in as extra storage.
Also on the Macbook, is VMWare Fusion, and one of my VMs is a 25 GB instance running XP Pro.
I realized I should be able to move the VM to the Firewire drive, and expand its storage by another 50 GB or so to have room to work on the 23 GB Thunderbird data.
To the Bat Cave!
It turns out to be straightforward to copy a VMWare image to another place, and then run the copy.  Rather than expand the 25GB primary disk, I just added a second virtual drive and used XP Disk management to format it as drive E.  I also used VMWare sharing to share access to the underlying Mac filesystem on the Firewire drive.

  1. Copy VMWare image of XP to the Firewire drive
  2. Copy MozBackup save file of the cached IMAP data and the various Local Files folders to the drive
  3. Create second disk image for XP
  4. Run XP under VMWare Fusion on the Macbook, using the Firewire drive for backing store
  5. Install Thunderbird and MozBackup
  6. Use Mozbackup to restore Cathy’s cached local copies of her mail from the flakey HP laptop
  7. Copy the Local Files mailbox files for 2013, 2012, and 2011 into place.
  8. Use XP Thunderbird running under VMWare to copy the rest of the cached IMAP data into Local Folders.
  9. By hand, compare message counts of all 50 or so other IMAP folders in the cached copy with those still on the server, and determine they were still correct.
  10. Go online, letting Thunderbird sync with the server, deleting all the locally cached IMAP data.
  11. Create IMAP folders for 2007 through 2013, plus Sent Mail and copy the roughly 40000 emails back to the server.

Notes
During all of this, new mail continued to arrive into the IMAP server, and be accessible by the instance of Thunderbird on the netbook.
A copy of Cloudmark Desktop One was active running on the Macbook using Mac Thunderbird to do spam processing of arriving email in Cathy’s IMAP account.
My psyche is scarred, but I did manage to recover from a monstrous mistake.
Lessons

  • RAID IS NOT BACKUP

The IMAP server was reliable, but it didn’t have backups that were useful for recovery.

  • Don’t think you understand what a complex email client is going to do

Don’t experiment with the only copy of something!  I should have made a copy of the IMAP .maildir in a new account, and then futzed with the netbook thunderbird to get the offline use storage the way I wanted.

  • Quantity has a quality all its own.

This quote is usually about massive armies, but in this case, the very large email (23 GB) just made the simplest operations slow, and some things (like selecting ALL in a folder with 34000 messages, impossible.)  I had to go through a lot of extra work because various machines didn’t have enough free storage, and had other headaches because the MTBF of the HP laptop was less than the time to complete tasks.
-Larry

Hypervisor Hijinks

At my office, we have a rack full of Tilera 64-core servers, 120 of them. We use them for some interesting video processing applications, but that is beside the point. Having 7680 of something running can magnify small failure rates to the point that they are worth tracking down. Something that might take a year of runtime to show up can show up once an hour on a system like this
Some of the things we see tend, with some slight statistical flavor, to occur more frequently on some nodes than on others. That just might make you think that we have some bad hardware. Could be. We got to wondering whether running the systems at slightly higher core voltages would make a difference, and indeed, one can configure such a thing, but basically you have to reprogram the flash bootloaders on 120 nodes. The easiest thing to do was to change both the frequency and the voltage, which isn’t the best thing to do, but it was easy. The net effect was to reduce the number of already infrequent faults on the nodes where they occurred, but to cause, maybe, a different sort of infrequent fault on a different set of nodes.
Yow. That is NOT what we wanted.
We were talking about this, and I said about the stupidest thing I’ve said in a long time. It was, approximately:

I think I can add some new hypervisor calls that will let us change the core voltage and clock frequency from user mode.

This is just a little like rewiring the engines of an airplane while flying, but if it were possible, we could explore the infrequent fault landscape much more quickly.
But really, how hard could it be?
Tilera, to their great credit, supplies a complete Multicore Development Environment which includes the linux kernel sources and the hypervisor sources.
The Tilera version of Linux has a fairly stock kernel which runs on top of a hypervisor that manages physical chip resources and such things as TLB refills. There is also a hypervisor “public” API, which is really not that public, it is available to the OS kernel. The Tilera chip has 4 protection rings. The hypervisor runs in kernel mode. The OS runs in supervisor mode, and user programs can run in the other two. The hypervisor API has things like load this page table context, or flush this TLB entry, and so forth.
As part of the boot sequence, one of the things the hypervisor does is to set the core voltage and clock frequency according to a little table it has. The voltage and frequency are set together, and the controls are not accessible to the Linux kernel or to applications. Now it is obviously possible to change the values while running, because that is what the boot code does. What I needed to do was to add some code to the hypervisor to get and set the voltage and frequency separately, while paying attention to the rules implicit in the table. There are minimum and maximum voltages and frequencies beyond which the chip will stop working, and there are likely values that will cause permanent damage. There is also a relation between the two – generally higher frequencies will require higher voltages. Consequently it is not OK to set the frequency too high for the current voltage, or to set the voltage too low for the current frequency.
Fine. Now I have subroutine calls inside the hypervisor. In order to make them available to a user mode program running under Linux, I have to add hypervisor calls for the new functions, and then add something like a loadable kernel module to Linux to call them and to make the functionality available to user programs.
The kernel piece is sort of straightforward. One can write a loadable kernel module that implements something called sysfs. These are little text files in a directory like /sys/kernel/tilera/ with names like “frequency” and “voltage”. Through the magic of sysfs, when an application writes a text string into one of these files, a piece of code in the kenel module gets called with the string. When an application reads one of these files, the kernel module gets called to provide the text.
Now, with the kernel module at the top, and the new subroutines in the hypervisor at the bottom, all I need to do is wire them together by adding new hypervisor calls.
Hypervisor calls make by linux are done by hypervisor glue. The glue area starts at 0x10000 above the base of the text area, and each possible call has 0x20 bytes of instructions available.
Sometimes, such as “nanosleep”, the call is implemented inline in those 0x20 bytes. Mostly, the code in the glue area loads a register with a call number and does a software interrupt.
The code that builds the glue area is is hv/tilepro/glue.S.
For example, the nanosleep code is

hv_nanosleep:
        /* Each time through the loop we consume three cycles and
         * therefore four nanoseconds, assuming a 750 MHz clock rate.
         *
         * TODO: reading a slow SPR would be the lowest-power way
         * to stall for a finite length of time, but the exact delay
         * for each SPR is not yet finalized.
         */
        {
          sadb_u r1, r0, r0
          addi r0, r0, -4
        }
        {
          add r1, r1, r1 /* force a stall */
          bgzt r0, hv_nanosleep
        }
        jrp lr
        fnop
while most others are

GENERIC_SWINT2(set_caching)
or the like. where GENERIC_SWINT2 is a macro:

#define GENERIC_SWINT2(name)
        .align ALIGN ;
hv_##name:
        moveli TREG_SYSCALL_NR_NAME, HV_SYS_##name ;
        swint2 ;
        jrp lr ;
        fnop
The glue.S source code is written in a positional way, like
GENERIC_SWINT2(get_rtc)
GENERIC_SWINT2(set_rtc)
GENERIC_SWINT2(flush_asid)
GENERIC_SWINT2(flush_page)

so the actual address of the linkage area for a particular call like flush_page depends on the exact sequence of items in glue.S. If you get them out of order or leave a hole, then the linkage addresses of everything later will be wrong. So to add a hypercall, you add items immediately after the last GENERIC_SWINT2 or ILLEGAL_SWINT2
In the case of the set_voltage calls we have:

ILLEGAL_SWINT2(get_ipi_pte)
GENERIC_SWINT2(get_voltage)
GENERIC_SWINT2(set_voltage)
GENERIC_SWINT2(get_frequency)
GENERIC_SWINT2(set_frequency)

With this fixed point, we work in both directions, down into the hypervisor to add the call and up into linux to add something to call it.
Looking back at the GENERIC_SWINT2 macro, it loads a register with the value of a symbol like HV_SYS_##name where name is the argument to GENERIC_SWINT2. This is using the C preprocessor stringification operator ## that concatenates. So

GENERIC_SWINT2(get_voltage)

expects a symbol named HV_SYS_get_voltage. IMPORTANT NOTE – the value of this symbol has nothing to do with the hypervisor linkage area, it is only used in the swint2 implementation. The HV_SYS_xxx symbols are defined in hv/tilepro/syscall.h and are used by glue.S to build the code in the hypervisor linkage area and also used by hv/tilepro/intvec.S to build the swint2 handler.
In hv/tilepro/intvec.S we have things like

 syscall HV_SYS_flush_all,	      syscall_flush_all
syscall	HV_SYS_get_voltage,	            syscall_get_voltage

in an area called the syscall_table with the comment

// System call table.  Note that the entries must be ordered by their
// system call numbers (as defined in syscall.h), but it's OK if some numbers
// are skipped, or if some syscalls exist but aren't present in the table.

where syscall is a Tilera assembler macroL

.macro  syscall number routine
      .org    syscall_table + ((number) * 4)
      .word   routine
      .endm

And indeed, the use of .org makes sure that the offset of the entry in the syscall table matches the syscall number. The second argument is a symbol elsewhere in the hypervisor sources of code that implements the function.
In the case of syscall_get_voltage, the code is in hv/tilepro/hw_config.c:

int syscall_get_voltage(void)
{
  return(whatever);
}

So at this point, if something in the linux kernel manages to transfer control to text + 0x10000 + whatever the offset of the code in glue.S is, then a swint2 with argument HV_SYS_get_voltage will be made, which will transfer control in hypervisor mode to the swint2 handler, which will make a function call to syscall_get_voltage in the hypervisor.
But what is the offset in glue.S?
It is whatever you get incrementally by assembling glue.S, but in practice, it had better match the values given in the “public hypervisor interface” which is defined in hv/include/hv/hypervisor.h
hv/include/hv/hypervisor.h has things like

/** hv_flush_all */
#define HV_DISPATCH_FLUSH_ALL                     55
#if CHIP_HAS_IPI()
/** hv_get_ipi_pte */
#define HV_DISPATCH_GET_IPI_PTE                   56
#endif
/* added by QRC */
/** hv_get_voltage */
#define HV_DISPATCH_GET_VOLTAGE               57

and these numbers are similar to, but not identical to thos in syscall.h. Do not confuse them!
Once you add the entries to hypervisor.h, it is a good idea to check them against what is actually in the glue.o file. You can use tile-objdump for this:

tile-objdump -D glue.o

which generates:

...
00000700 <hv_get_ipi_pte>:
     700:	1fe6b7e070165000	{ moveli r0, -810 }
     708:	081606e070165000 	{ jrp lr }
     710:	400b880070166000 	{ nop ; nop }
     718:	400b880070166000 	{ nop ; nop }
00000720 <hv_get_voltage>:
     720:	1801d7e570165000	{ moveli r10, 58 }
     728:	400ba00070166000 	{ swint2 }
     730:	081606e070165000 	{ jrp lr }
     738:	400b280070165000 	{ fnop }
...

and if you divide HEX 720 by HEX 20 you get
I use bc for this sort of mixed-base calculating:

stewart$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
ibase=16
720/20
57
^Dstewart$

and we see that we got it right, the linkage number for get_voltage is indeed 57
Now let’s turn to Linux. The archtecture dependent stuff for Tilera is in src/sys/linux/arch/tile
The idea is to build a kernel module that will implement a sysfs interface to the new voltage and frequency calls.
The module get and set routines will call hv_set_voltage and hv_get_voltage.
The hypervisor call linkage is done by linker magic, via a file arch/tile/kernel/hvglue.lds, which is a linker script. In other words, the kernel has no definitions for these hv_ symbols, they are defined at link time by the linker script. For each hv call, it has a line like

hv_get_voltage = TEXT_OFFSET + 0x10740;

and you will recognize our friend 0x740 as the offset of this call in the hypervisor linkage area. Unfortunately, this doesn’t help with a separatley compiled module because it doesn’t have a way to use such a script (when I try it, TEXT_OFFSET is undefined, presumably that is part of the kernel main linker script. )
So to make a hypervisor call from a loadable module, you need a trampoline. I put them in arch/tile/kernel/qrc_extra.c, like this

int qrc_hv_get_voltage(void)
{
  int v;
  printk("Calling hv_get_voltage()n");
  v = hv_get_voltage();
  printk("hv_get_voltage returned %dn", v);
  return(v);
}
EXPORT_SYMBOL(qrc_hv_get_voltage);

The EXPORT_SYMBOL is needed to let modules use the function.
But where did hvglue.lds come from? It turns out it is not in any Makefile, but rather is made by a perl script in sys/hv/mkgluesyms.pl, except that this perl script optionally writes assembler or linker script output and I had to modify it to select the right branch. The modified version is mkgluesymslinux.pl and is invoked like this:

perl ../../hv/mkgluesymslinux.pl ../../hv/include/hv/hypervisor.h >hvglue.lds

The hints for this come from the sys/bogux/Makefile which does something similar for the bogux example supervisor.
linux/arch/tile/include/hv/hypervisor.h is a near copy of sys/hv/include/hv/hypervisor.h, but they are not automatically kept in sync.
Somehow I think that adding hypervisor calls is not a frequently exercised path.
To recap, you need to:

  • have the crazy idea to add hypervisor calls to change the chip voltage at runtime
  • edit hypervisor.h to choose the next available hv call number
  • edit glue.S to add, in just the right place, a macro call which will have the right offset in the file to match the hv call number
  • edit syscall.h to create a similar number for the SWINT2 interrupt dispatch table
  • edit intvec.S to add the new entry to the SWINT2 dispatch table
  • create the subroutine to actually be called from the dispatch table
  • run the magic perl script to transform hypervisor.h into an architecture dependent linker script to define the symbols for the new hv calls in the linux kernel
  • add trampolines for the hv calls in the linux kernel so you can call them from a  loadable module.
  • write a kernel module to create sysfs files and turn reads and writes into calls on the new trampolines
  • write blog entry about the above

 

How are non-engineers supposed to cope?

The Central Vac

Today the central vacuum system stuck ON.
The hose was not plugged in, and toggling the kick-plate outlet in the kitchen did not fix it.  That accounted for all the external controls.
The way this works is there is a big cylinder in the basement with the dust collection bin and large fan motor to pull air from the outlets, through the bin, and outside the house.  This is a great way to do vacuuming, because all the dusty air gets exhausted outside.
The control for the fan motor is low voltage that comes to two pins at each outlet.  When you plug in the hose, the pins are extended through the hose by spiral wires that then connect to a switch at the handle.  You can also active the fan by shorting the pins in the outlet with a coin.  Each outlet has a cover held closed by a spring.  You open the cover to insert the hose.  The covers generally keep all the outlets sealed except the one with the hose plugged in.
The outlets are all piped together with 1 1/2 inch PVC pipe to the inlet of the  central unit.  The contact pins at all the outlets are connected in parallel, so shorting any of them turns on the motor.
We also have a kickplate outlet in the kitchen – turn it on and sweep stuff into it.  The switch for that is activated by a lever that also uncovers the vacuum pipe.
I ran around the house to make sure nothing was shorting the terminals in the outlets.
Next, I went to the cellar to look at the central unit.  Unplugging it made it stop (good!) but plugging it back it made it start again.   That was not good.
I noticed that the control wires were connected to the unit via quick connects, so I unplugged them.  The unit was still ON, which meant the fault was inside the central unit.
I stood on a chair and (eventually) figured out that the top comes off, it is like a cookie tin lid.  Inside the top was the fan motor (hot!) and some small circuit board with a transformer, some diodes, and a black block with quick connect terminals.  The AC power went to the block and the motor wires went to the block.  I imagine that the transformer and the diodes produce low voltage DC for the control circuit, and the block is a relay activated by the low voltage.
Relays can stick ON if their contacts wear and get pitted, or there could be a short that applied power to the relay coil.
I blew the dust off the circuit board, and gave the block a whack with a stick.
That fixed it.
I just don’t see what a non-engineer would do in this situation, except let the thing run until the thermal overload tripped in the fan motor (I hope it has one!) and call a service person.  Even if the service folk know how to fix it without replacing the whole unit, it is going to cost $80 ro $100 for a service call.
I don’t have any special home-vacuum-system powers, but I have a general idea how it works, and a comfort level with electricity that I don’t mind taking the covers off things.  This time it worked out well.

The Dishwasher

For completeness, I should relate the story of our Kitchenaid dishwasher.  One day something went wrong with the control panel, so I took it apart.  It wasn’t working, and I thought I couldn’t make it much worse.  I was wrong about that.
I didn’t really know the correct disassembly sequence, and I took off one too many screws.  The door was open flat, and taking off the last screw let the control panel fall off, tearing a kapton flex PC board cable in two.  The flex cable connected the panel to some other circuit board.  I spent a couple of days carefully trying to splice the cable by scraping off the insulation and soldering jumpers to the exposed traces, but I couldn’t get the jumpers to stick.  New parts would have cost about $300, and the dishwasher wasn’t that new.  We eventually just bought a new Miele and that was the Right Thing To Do, because the Miele is like a zillion times better. It has built in hard-water softeners, and doesn’t etch the glasses, and doesn’t melt plastic in the lower tray, and is generally awesome.
So OK, sometimes you can fix it yourself, and sometimes you should really just call an expert.  How are you supposed to know which is the case?

The Garage Door Opener

Every few years, the opener stopped working.  It would whirr, but not open the door.  The first time this happened, I took it apart.  Now you should be really careful around garage door openers, because there is quite a lot of energy stored in the springs, but if you don’t mess with the springs, the rest of it is just gears and motors and stuff.
On mine, the cover comes off without disconnecting anything.  Inside there is a motor which turns a worm gear, which turns a regular gear, which turns a spur chain wheel, which engages a chain, which carries a traveller, which attaches to the top of the door.  The door is mostly counterbalanced by the springs.  With the cover off, you could see that the (plastic) worm gear had worn away the plastic main gear, so the motor would just spin.  The worm also drove a screw that carries along some contacts which set the full open and full closed travel, which stops and reverses the motor.  The “travel” adjustments just move the fixed contacts so the moving contacts hit them earlier or later.
An internet search located gear kits for 1/3 or 1/4 the price of a new motor, and I was able to fix it.
Last time the opener stopped working, however, the symptoms were different – no whirring.  The safety sensors appeared to be operational, because their pilot lights would blink when you blocked the light beam.  I suspected the controller circuit board had failed.  A replacement for that would be about 1/2 the cost of a new motor unit and I wan’t positive that was the trouble, so I just replaced the whole thing.  The new one was nicely compatible with the old tracks, springs, and sensors.
A few weeks later, my neighbor’s opener failed in the whirring mode, so we swiped the gears from my old motor unit with the bad circuit board and fixed it for free.

Take aways

Don’t be afraid to take things apart, at least if you have a reasonable expectation that you are not going to make it worse.
Or – Good judgement comes from experience, but experience comes from bad judgement. (Mulla Nasrudin)
… and just maybe, go ahead and get service contracts for complicated things with expensive repair parts, like that Macbook Pro or HE washing machine, particularly when the most-likely-to-fail part is electronic in nature.
So I usually get AppleCare, and we have a service contract for the new Minivan, and for the washing machine, but <not> for the clothes dryer, since it doesn’t appear to have any electronics inside.  I was able to fix that by replacing the clock switch myself.
But how are non-engineers supposed to cope?
 
 

What I do

I used Splasho’s “Up-Goer Five Text Editor.”  to write what I do, using only the most common 1000 words in English
In my work I tell computers what to do. I write orders for computers that tell them first to do this,and then to do that, and then to do this again.
Sometimes the orders tell the computer to listen for other orders from people. Then the orders tell the computer how to do what the people want, and then the orders tell the computer to show the people what the answer is.
I used to build computers. I would take one part, and another part, and many more parts, and put them together in just the right way so the computer would work right. Computers are all the same, they listen for an order, then do what it says, then listen for another order. We use them because they do this thing very very very very fast.

Equal Protection of the Law

I’ve been casting about for a way to follow up on my outrage of the government’s treatment of Aaron Swartz.
I wonder if the government’s conduct represents a violation of the equal protection clause of the constitution.
The 14th amendment says

…nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction the equal protection of the laws.

Evidently this doesn’t apply to the federal government as written, but in Bolling v. Sharpe in 1954, the Supreme court got to the same point via the Due Process clause of the 5th amendment.
I think all governments, state, federal, and local, are bound to provide equal protection.
In the Swartz case, we have the following mess

  • Congress writes vague laws
  • Congress fails to update those laws as technology and society evolve
  • Prosecutors use their discretion to decide who to charge
  • Prosecutors use pre-trial plea bargaining to avoid the scrutiny of the courts

It would be nice to have a case before the Supreme Court, leading to a clear ruling that equal protection applies to the actions of prosecutors. I suspect that would also give us proportional responses to crimes, although I am not sure about that.
In the medium term, Congress needs to act.  I’d suggest a law repealing all laws more than 20 years old.  Sunset provisions need to be in all laws. The ones that make ongoing sense can be reauthorized, but it will take a new vote every time.  (Maybe laws forbidding action by the government should be allowed to stand indefinitely, while laws forbidding action by the people will have limited terms.)
In the short term, we need action by the executive branch, to provide equal protection, control of pre-trial behavior of prosecutors, and accountability of both prosecutors and law enforcement.
 

AT&T Hell

Summary – AT&T customer service gives you bad information, tries to fix it and can’t, then lies about how it is “impossible”.
Update summary – Twitter works!  AT&T twitter team seems to have fixed the remaining problem.
“We don’t care, we don’t have to.” – Lily Tomlin
When I worked for IBM one summer, I wore a tie every day to see if I could do it.
When I drove an RX-7 in Palo Alto, I obeyed all the speed limits, to see if I could do it.
Last month I gave up my iPhone, to see if I could do it.
My daughter wanted an iPhone, but she’s in the middle of a two year contract on T-Mobile with a Palm Pixi.  My iPhone 4S is in the middle of a two year contract with AT&T that started October 2011.  It had the grandfathered unlimited data plan, and would be up for upgrade eligibility in May 2013.
On December 26, I called AT&T to see if I could port my number out and get a new number assigned to the iPhone, so I could let my daughter use it, while I would keep the T-Mobile phone, but with my number.  My number started out life a long time ago as a Verizon landline, with the number sequential to our home phone, so I am attached to it.  It is also on all my business cards and in countless contact lists.
AT&T said “sure”, when you port the number out, we’ll assign a new number to the iPhone and the contract will remain unchanged.
Life was good!  The daughter is happy, and I have a phone that is, um, interesting.  I also have an iPad, so don’t shed any tears about that!
A week or so later, we notice that the bill is $400ish.  There is an early termination charge on there!  You can’t actually figure out what the charge is from the online presentation.  You have to hunt up the pdf and look at the image of the printed bill.  This is a phone company, they know how to print phone bills, not how to build websites.
On the phone with customer service.  “When you ported out the number, that cancels your contract, and you get an early termination fee. Then you added a new line with new contract dates.”  I explained my call on the 26th, and the agent said, oh, well I can waive the early termination fee and make the contract be as it was. The only thing I can’t do is preserve the unlimited data plan.  So now the phone is on the 3GB plan.   I thought about balking, that unlimited plan made me feel like an old-time iPhone user, more privileged than the unwashed masses, but really, my usage is about 250 MB per month.  The iPad has a bigger screen.  So I let it slide.
A few days later, a website check showed the fees gone.  I noticed that the upgrade availabiltiy wording was different for this phone than for the other iPhone line, which also started October 2011, but decided to wait to see if other changes would catch up before calling.
A few days later, no change.  Called and learned that the second agent had waived the fees, but not fixed the contract dates.  I was assured that all would be fixed, and notes put in the account.
A few days later, no change to the upgrade language.  On calling, I was told that the contract would expire October 2013, as expected, but the upgrade eligibility date was July 2014.  What does that even mean?  After the contract is over, I can just create a new line, with a new contract and phone, and port the number!  It makes no sense to have an upgrade eligibility after the contract expiration.  Anyway, this is just stupid.  I explained that I had been told “the contract would be as it was” but the agent said there was just no way to change that in his system, the upgrade eligibility is tied to the phone number, not to the contract.
[By the way, this is also a lie, because, for example, if you are being stalked, you can request a new number and get it without any such collateral damage.]
I asked for a supervisor, who said
This should never have been allowed in the first place.  You can’t port out a number and keep the contract. It is our number.  The agents who tried to “fix” it for you went way outside our policies and made it worse.  What they should have done to correct their original mistake was to port your number back in, not to try and fix the contract. It can’t be fixed, it is impossible to change an upgrade eligibility date. It is tied to the phone number.
The supervisor said there were no higher supervisors to talk to, and no physical mail address to send a complaint to.
Well.  This supervisor was certainly polite, but either was really unable to fix the problems that AT&T created, or unwilling to do so.
At the moment, I have a nice iPhone, with a pleased daughter, but I am not pleased.  I made a perfectly sensible request.  I was told “Yes, of course you can do that” and now the account is scrambled beyond belief.
Recapping

  •  iPhone 4S, 14 months into a 24 month contract.
  • I ask to port out my number, and get a new number assigned to the phone,without contract changes.  I’m not paying them any less, I am not getting a new phone, just changing a few bits in a database somewhere about what is the number!
  • AT&T says “yes”
  • AT&T charges an early termination fee, an activation fee, cancels my unlimited data plan, restarts the 2 year contract, and resets the upgrade eligibility data.  I am not even angry about the activation fee, they deserve some fee for the work.
  • I complain.  AT&T waives the early termination fee, promises to fix the contract, but doesn’t
  • I complain.  AT&T promises to fix the contract, but only fixes the contract termination dates, the upgrade date is now 9 months after the contract expires.
  • I complain.  AT&T says “impossible to fix”
  • AT&T supervisor says “impossible to fix, and there is noone higher than me to ask”
The only thing that an upgrade date after contract expiration might mean is that AT&T would refuse to unlock the phone until it is 2 3/4 years old.  That would piss me off, but I don’t even want to ask them right now.

And by the way, the iPhone battery doesn’t work as well as it used to, and that 18 month upgrade was starting to look pretty attractive!  Instead, I will likely have to pay Apple $79 to fix it.  At least that is cheaper than the $99 Applecare I forgot to get, if nothing else goes wrong with the phone.
Now I am not a phone company marketing person, but I think I understand the essential economics of subsidized phones.  AT&T gives a substantial discount on the phone in trade for a contract commitment.  In fact, this is still a worse deal for the customer than buying an unlocked phone on a carrier with cheaper plans, like Virgin or T-Mobile, but AT&T doesn’t discount the monthly charges if you bring your own device.  That is just another way to screw the consumer.  So with AT&T, you may as well get the subsidy if you don’t mind sticking around for two years.  And they really make their money back so quickly that they let you upgrade (and restart the two year clock) after 18 months.
This is a simple deal – AT&T discounts the phone, I promise to keep paying their (high) monthly bills for two years.  This has nothing to do with the phone number!  Changing the number has utterly no effect on the money flows.
What about that number?  AT&T says it is their number, they can attach whatever they want to it.  But that is not true.  I had the number with Verizon. I ported the number to AT&T, I ported it out.  The FCC has “local number portability”.  The numbers are managed by CLECs (I think that is the term of art for phone companies) but they really can’t be taken away from users except for some arcane technical reasons.
What has happened here?  It cannot be “impossible” to fix these sorts of problems.  There may be software limitations, but those are fixable.  Or they could merely write a note to themselves saying “Yes, the system says this contract runs until July 2014, but when the customer asks, in May 2013, for an upgrade, just waive the fees.  And when the customer cancels the contract in October 2013, waive any cancellation fee.”
Instead, they’ve spent a lot of money on customer service phone calls, which are not cheap. They’ve enraged a long-standing customer who has alternatives. They’ve provided more information to the entire internet about just how bad their service and systems are.  There is no good result for AT&T here. They’ve not gained any income. They haven’t kept control of their precious number. They may well lose me as a customer come October.  (That Nexus 4 on T-Mobile is looking pretty good, or a nice unlocked iPhone 5S or whatever.) And they are defending positions and policies that make no sense competitively or economically.
I’m not sure of the next step for me.  Probably I will tweet the URL of this blog entry to @ATTCustomerCare.  At this point, AT&T can fix the problems, or they can provide me a source of continuing amusement.  There’s a rumor that sometimes people get results by writing the CEO.  At a minimum that will cost them even more money to deal with my letter.
UPDATE – I tweeted this URL to @ATTCustomerCare and they actually answered, got me on the telephone, and fixed this, well enough.  Which is to say they can’t fix it in the database, but they’ve added a special note telling other folks to honor an upgrade request on or after the correct date.  Works for me.  (1/16/2013)
You can sort of understand how enterprise software can become unwieldy, to the point where it seems easier to correct software problems and poor specifications by adding layer upon layer of special fixes and exceptions and end-runs, but it is not good for customers or efficiency to do it that way.
 
 
 
 

Buying a lemon

Last month we got a shiny new Stop and Shop grocery store here in Wayland.  They’ve been having various grand opening specials so we have been dropping by.  I went over there Sunday evening to buy blueberries (two pints for $3! in January!) but they were out of stock.  I managed to leave the shopping list at home, so I had to go by my wits, which is really not such a good idea.
I checked out using the ScanIt! gadget, and this time I remembered to wait for the coupon accepted tone before dropping my coupon in the slot.  Last time I had to have staff fish my should-have-worked coupon out of the guts of the machine and fix it, but I digress.
After finishing, I called Cathy to see what I had forgotten and she told me to remember to get a lemon and to get a rain check for the 10/$10 frozen vegetables they had run out of.  (I already had a rain check for the blueberries).
I didn’t get another ScanIt! machine for one lemon, so I went over to produce and picked out a nice lemon.  66 cents each!  Should be 50. I carefully put it on the scale, typed in the produce code, and entered my quantity,  The machine prints a scannable sticker, which I stuck on the lemon.
At the self-checkout I scanned the lemon and touched “pay”.  While the machine thought about it, I got exact change from my wallet and began to feed in coins.  Around about 55 cents, I noticed the amount due was $4.03.  There was no cancel button.  At that point I looked at the lemon, and the sticker said “7” rather than “1”.   I think the produce machine must have a calculator style keypad, with 7 at the upper left, rather than a phone keypad with 1 at the upper left.
I think this is 1200 baud modem training to blame.  In those days, you typed way ahead of the computer, and since you knew what it was going to do, there was no real need to actually look at the screen when it caught up.
At this point, there was nothing to do but press  the I Need Help button and look sheepish.
A nice girl with bright orange hair came over and I explained.  I think this was a new one.  She scanned her superuser card and after flipping through some screens said “I don’t think there is any way to change an order after you start paying…. But I can refund the money.”
[Side note: The machine refunded a different collection of coins that happened to add up to 55 cents, rather than returning my coins.  I suppose this lets you overload the change and the refund mechanism.]
After she left, I entered 1 lemon through the produce lookup screens, and again hit pay, and started putting my coins in.  This time, after a few coins, the machine said $5 something or other to go.  I had done it again!  Evidently the 7 virtual lemons were still on the tab, as well as the one real lemon.  I had to call for help again.
The same girl with the bright orange hair came over, and apologized to me, apparently for my being an idiot, and this time refunded the money, and deleted the 7 lemon line item, leaving only one lemon.  I successfully paid, and fled.
It is a mixed blessing that the store was essentially deserted.  No one was there to watch my performance, but neither was there any press of work to distract the staff from chuckling over the befuddled customer.
And I forgot to get the rain check for the frozen peas.

Aaron Swartz

Aaron Swartz, 26, committed suicide the other day, evidently hounded to his death by overzealous prosecutors.
I didn’t know Mr. Swartz, and I don’t condone his actions of a couple of years ago, where it is alleged that he attached equipment to the MIT computer network to steal academic articles from the JSTOR database in order to release them to the public.
However, the more I learn about the conduct of the government in prosecuting Mr. Swartz, the angrier I get.
For those lacking any context, go read what Larry Lessig had to say in

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

or what Cory Doctorow had to say in

http://boingboing.net/2013/01/12/rip-aaron-swartz.html

Here is the letter I’ve sent to my Senator, Elizabeth Warren.  I’ve sent a similar letter to Sen. John Kerry

I call to your attention the recent suicide of Aaron Swartz.  It looks
very much to me like the US Justice Department hounded him to his
death by overzealous prosecution of a victimless “crime” if it even was
a crime.

Larry Lessig writes on the case:
http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully
I would like to know what you are doing to hold the prosecutors and
their bosses at Justice to account for this affair.
I voted for you in part for your history of representing the issues
of ordinary people against big business.  Please also represent us
against the oppressive power of government.
-Larry Stewart
I’ve sent the following email to Rafael Rief, President of MIT

I understand that the Swartz affair started before you became president of MIT, but I think you should explain to the community what happened, why it happened, and exactly what principles MIT holds.

From what I’ve heard, MIT provided the pretext necessary for the US Attorney ****** to hound Aaron Swartz to his death.

 See, for example, Larry Lessig’s account at

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

It may well be that Mr. Swartz was guilty of something, and it may be that MIT favored prosecution, but once MIT started such a ball rolling MIT became responsible in part for the damage it caused.  At the minimum, MIT had an obligation to track the case and to speak out loudly when it began to go off the rails of proportional justice in such a dramatic way.

-Larry Stewart ’76

(name removed because I am not sure I got it right)

I don’t know what the right answers are in this case, but I am beginning to think we should handle failures of justice in the same way we handle airplane crashes.  Do we need an equivalent of the National Transportation Safety Board to investigate?  Such a group could find out what happened, why it happened, and what legal, procedural, training, and technical measures are needed to keep it from happening again.  And their reports and proceedings should be open.
We now have so many laws and crimes, and so many are ill-defined, that likely everybody is “guilty” of something.  When the full oppressive power of government can be brought to bear on anyone at the discretion of individuals or groups with their own agenda, then no one is safe.

 UPDATE

About an hour after I wrote to MIT President Reif, he wrote to the community.  Obviously he’s well ahead of me on this one, since his message must have already been in progress.   Professor Hal Abelson will be leading a thorough analysis of MIT’s involvement.  I await the report with interest.
http://web.mit.edu/newsoffice/2013/letter-on-death-of-aaron-swartz.html