The obvious missing feature

I think there are great opportunities for sensible people to make money doing usability analyses of web based systems.

Let me give some examples of well intentioned systems with the obvious feature left out.

Email addresses

I have a Capitol One credit card, and in my user profile, there a place to enter an email address so they can send me stuff.  (In another post I will rant about email addresses further)  Recently I happened to log in to set up alerts for spending and so forth.  The email notifications were disabled because, they said, the email address I had entered had been refused.  Yet the address was actually correct.

This is not unknown.  We had a crash a while back of our cloud email server, and we didn’t notice for hours, so it is possible mail was bounced.

There was no way to tell the Capitol One system “test it now please”.  Instead, I had to change the address to a different one.  This made them happy even without a test.  I suppose I could then change it back, but how much time do I have to spend working around a bad design?

Phone numbers

Many sites require phone numbers.  They have no uniform way of entry.  Some have free form fields, but limited to exactly 10 characters.  Some forbid hyphens.  Some require hyphens.  Some have exactly three fields, for area code, exchange, and number.  Is it really that hard to parse a variety of formats?  Do they really think making me keypunch my number is helping their image?


I have my bank account and credit cards set up to send my text notifications when there is activity. One bank only allows notifications for amounts above $100.  Why does that even make sense? They can handle small deposits, but they can’t handle sending a text for a $10 charge? At least the text on the page explains the limit.

A credit card company has the same feature, but allows texts for any transaction amount, except $0! If I want notificications on all transactions, what limit value should I use?  I telephoned, and the agent suggested $0.01.

I’m getting to be a curmudgeon when things like this offend me.


Notifications – unclear on the concept

Tthis is a post about organizations trying communicating with their customers but getting it wrong.

I have signed up for various notifications, typically by text or email.  Tragically, sometimes organizations manage to use these in a way that makes me think they are idiots.

  • I just received a text from my local library that a book I’ve had on hold forever has come in.  The problem is that I picked it up last night.
  • I got an email from my Honda dealer that my minivan is due for service – two days after the service was done, by them.
  • I get both emails and texts from Target that my store credit card payment due date is coming up — even though my balance is zero.

To me these seem like violations of a  simple and obvious design principle:  don’t send a notification that is moot.  All it does it point out to your customer that your systems are broken.  And that means that your organization is clueless and really should not  be trusted with my business.

Delay is also important.  I have my Bank of America profile set so that I get texts notifying me of ATM withdrawls.  I should get them when I do a withdrawl, but never at other times.  Often, these arrive within minutes, but sometimes, they take 6 hours or so to arrive.  The immediate feedback ratchets up my confidence that I would find out immediately if fraudulent activity were to occur.  The delayed feedback?  They are having the opposite effect.  I obviously cannot trust BofA systems to notify me of activity in a timely way.  Should I trust them for anything else?


Cryptographic Modules

Steve Bellovin has a post The Uses and Abuses of Cryptography in which he comments on the recent Anthem data breach.  At Anthem, supposedly, the database of important stuff like customer addresses and social security numbers was not encrypted, because it was in use all the time.Steve says, “If your OS is secure, you don’t need the crypto; if it’s not, the crypto won’t protect your data.”

His point is that the decryption keys have to be in RAM somewhere for the system to work, so if the OS is insecure, the keys can be stolen, and the encrypted database decrypted anyway.  This is not necessarily true.  IBM (see, for example IBM PCIe Cryptographic Coprocessor) and others make hardware units that provide encryption and decryption, and store the master keys.  With appropriate hardware, the keys are NOT in RAM and can’t be stolen.  This still isn’t enough, because a compromised host system can still command the crypto box to decrypt the data.  To go further, you have to have velocity checks inside the trusted part of the system, to alarm on and halt unexpected volumes of traffic.

The data itself also has to be carefully organized. Each record has to have its own key.  Record keys are stored in the database encyrpted by a master key, which is only stored in the cryptographic hardware module.

It is probably going to be impossible to prevent theft of individual records.  An insider can always photograph data off the screen in a call center.  I think we can do much better about technical means to prevent bulk data breaches.

There is a whole new area of research on how to make cloud computing trustworthy.  How can you get anything done when your code is running on potentially compromised hardware or on a virtual machine pwned by the bad guys?  It might be possible!  Homomorphic encryption makes it possible to perform computations on encrypted data, and perhaps cloud servers will at least come with cryptographic modules that at least can limit the rate at which your data can be stolen.

Update: February 24, 2015

Steve points out via email that many kinds of tasks, such as a batch job generating annual statements, have to touch all records, so rate-limiting (velocity checks) might not be effective.

Net Neutrality

I wrote a letter to the editor of the Wall Street Journal today.  In my opinion, Internet service providers and backbone providers should be “common carriers”.  They should not be allowed to charge different rates for different bits, and they shouldn’t be allowed to even look at the traffic other than for routing.  Today I was so offended by the disingenuousness and misrepresentation of L. Gordon Crovitz’ op-ed that I felt compelled to respond:

Timothy Lemmer
Letters Editor
Wall Street Journal

Regarding “The Great Internet Power Grab” by L. Gordon Crovitz, Feb. 8, 2015.  Mr Crovitz is misinformed or disingenuous.

The FCC proposes to reclassify broadband Internet access services – consumer access to the net – as a telecommunications service rather than as an information service.  The FCC does not propose to regulate content providers or startups providing innovative services, or end users of any sort.

Mr. Crovitz proposes we should be so afraid of unlikely future abuses by regulators that we should not move to stem current and actual abuses by the cable and telephone industries that provide the majority of internet access.

  • Verizon spies on customer communications to install tracking cookies (1)
    Comcast demands payments from content provider Netflix merely to get access to customers (2)
  • ATT blocks customers who attempt to encrypt their own email (3)
  • These are actual abuses by companies exploiting their near monopoly positions to damage competition, harm innovation, and endanger customer privacy.

It would be great if Congress would get its act together to promote innovation and forbid discrimination.  Until then, the FCC appears to be doing its best to protect the public from the telecom companies who are the current unaccountable gatekeepers of the net.

Lawrence Stewart
Wayland, MA



Windows 7 Disk Upgrade

It is a mystery to me why laptop makers charge such a premium for SSDs.  Well, no, it’s not a mystery, they do it because they can.  Part of the reason is that it is such a pain, in the Windows world, to upgrade.

Cathy recently got a new HP ProBook 640 G1, replacing her ancient Vista machine.  The new laptop came with a 128 GB SSD, which served its purpose of demonstrating how dramatically faster the SSD is than a regular hard drive, but it is too small.  Her old machine, after deleting about 50 GB of duplicate stuff, was already at 128.

It is much cheaper to buy an aftermarket 256 GB SSD than to buy the same laptop with a larger SSD. so we set about an upgrade.

HP Laptops, at least this one, do not ship with install disks, instead, they come with a 12 GB “recovery partition” that soaks up even more of the precious space.  You can reinstall the OS from the recovery partition as often as you like, or you can, exactly once, make a set of recovery DVDs or a recovery USB drive.

There are two main paths to doing a disk upgrade:

  • Replace the disk, and reinstall from the recovery media
  • Replace the disk, make the old disk an external drive, and clone the old disk to the new one.

The first path is less risky, so we tried that first.  I had purchased a nice, large USB3 thumb drive for the purpose, and … the HP recovery disk creator would not create a USB!  What is this, 2004?  HP support is actually quite good, and I suppose that is part of what you pay for when you buy a “business” notebook.  They were surprised by this lack of functionality, since it is supposed to work, and eventually decided to send us recovery media.  They sent DVDs, which is not what we want, but fine.

The HP media worked fine to install onto the new 256 GB SSD, but did not restore much of the HP add on software.  Most manufacturer add-on software is crapware, but HPs isn’t bad.  We got most of the missing bits from the website except for the HP documentation!  You can get the PDF files for the user and service manuals, but not the online HP Documentation app.

Our plan was eventually to trickle down the 128 GB SSD to one of the kids, so we didn’t mind using up its ability to create recovery media, so we tried that next.  Rather than screw up the almost-working 256 GB drive, we installed an old 160 GB drive from Samantha’s old Macbook (replaced earlier by an SSD).

The home-created recovery media did better, installing all the HP add-ons…except the documentation!

Now with three working drives, and two sets of recovery disks, I felt confident enough to try the alternative: cloning the original drive.  I had a copy of Acronis True Image 2010, but couldn’t find the disk for it.  The new SSD came with a copy of True Image 2014, but first I read up on the accumulated wisdom of the Internet.  There’s a guy, GroverH, on the Acronis forums (see ) who has an astonishing set of howtos.

Manufacturers who use recovery partitions really don’t want you to clone drives, perhaps this is pressure from Microsoft.  It works fine if the new drive is exactly the same as the old one, but if not, unless the partition sizes are exactly the same, the result is not likely to work.  The cloning software will scale the partitions if you restore to a bigger drive, but they won’t work.  You have to manually tweak the partition arrangement.  Typically the recovery partition is at the end, the boot partition is at the beginning, and the “C:” drive uses the space inbetween.

Now earlier when I couldn’t find the True Image install disk on another project, I tried the Open Source CloneZilla and was quite happy with it.  It is not for the faint-hearted, but it seems reliable. I used CloneZilla to make a backup of the original drive, and then, because the recovery media had already created a working partition structure, merely restored C: to the C: of the experimental 160 GB drive.  Windows felt like it had to do a chkdsk, but after that it worked, and lo, the HP documentation was back!  (And Cathy’s new screen background.)

As the last step, we put the 256 GB SSD back in, and used CloneZilla to restore C: and the HP_TOOLS partition contents that weren’t quite the same in the original and recovered versions.


So, contrast to a disk upgrade on a Mac:  Put in new drive, restore from Time Capsule, done.  And this restores all user files and applications!

Next challenge: migrating Cathy’s data files and reinstalling applications.  Memo to Microsoft:  it is just unreasonable that in this new century we still have to reinstall applications one by one.


Hotel Internet – Hyatt French Quarter

I write from my room at the Hyatt French Quarter.

Your hotel internet service stinks.

I would rather stay in a Hampton Inn or like that than a Hyatt.  You know why?  The internet service in cheap hotels just works.  Yours does not.

You advertise “free internet”, but it costs rather a lot in the inconvenience and irritation of your customers, who are paying you quite a lot of money for a nice experience.

I have three devices with me.  A laptop, a tablet, and a phone.  On each one, every day of my stay, at (apparently) a random time, each one stops working and I have to connect again.

Here is what that takes:

  • Try to use my email.  Doesn’t work
  • Remember that I have to FIRST use a web browser.
  • Connect to hotel WiFi (ok, this step is expected, once)
  • Get browser intercept screen
  • Type in my name and room number
  • Wait
  • Read offer to pay $5 extra for “good” internet service, rather than crappy. The text says this offer “lasts as long as your current package”  is that per day? Per stay? What?
  • Click “continue with current package”
  • Wait
  • Get connected to FACEBOOK.

Why?  I can’t explain it.  People my age think Facebook is something kids use
to share selfies.  The kids think Facebook is for, I don’t know, old people, they
are all on Twitter.

Then I have to remember what I wanted to do.

Are you serious?  Do you think this process, repeated for my three devices, EVERY DAY, is going to make me recommend your hotel?

Now let us talk about privacy.

It irritates me that you want my name and room number. I do not agree that you can track my activities online.  It is none of your business.  I run an encrypted proxy server back home.  So all your logs will show is that I set up one encrypted connection to the cloud for my web access.  My email connections are all encrypted.  My remote logins to the office are all encrypted.  My IMs are encrypted.
I read the terms and conditions, by the way.  They are linked off the sign on page.   They are poorly written legalese, and there are a number of ways to read them.  One way says that you track all my connections to websites but only link them to my personally identifiable information if you need to “to enforce our terms and conditions”.  They also say that you have no obligation to keep my activities confidential.  And who or what is Roomlynx?

Even if your terms said otherwise, I wouldn’t believe you.  I don’t trust you OR your service providers.

Here’s my suggestion:

I think all this effort you’ve gone to is a waste of time, effort, and money. You do not have the technical means to monitor or control how I use the net anyway, so why make your customers jump through hoops?

If your lawyers tell you these steps are necessary, get different lawyers who have a clue.  If you still think it is necessary, have the terms and conditions be attached to the room contract!

If you seriously have a problem with non-guests soaking up your bandwidth, then by all means add a WiFi password, and hand it out at checkin.

If you seriously have a problem with bandwidth hogs, then slow down the connections of actual offenders.

Basically, try your best to make the Internet work as well as the electricity you supply to my room.  I turn on the switch, the lights go on. Done.

By the way, modern OS’s like Apples MacOS Yosemite, frequently change the MAC address they use. This will likely break your login system, raising the frustration of your guests even more.  They will not blame Apple for trying to protect their privacy.  They will blame you.  I already do.

PS  I don’t like to help you debug a system that is fundamentally broken, but:

  • The hotel website still says Internet costs $9.95 per day.  Update that maybe?
  • There is no way to go back and pay the extra $5 for better service one you’ve found out how crappy the regular stuff is.
  • After you connect, you can no longer find the terms and conditions page
  • I accidently tried to play a video, and your freaking login screen showed up in the video pane.  That just makes you look even sillier.

Random Walks

One blog I follow is GÖDEL’S LOST LETTER

In the post Do Random Walks Help Avoid Fireworks, Pip references George Polya’s proof that on regular lattices in one and two dimensions, a random walk returns to the origin infinitely many times, but in three dimensions, the probability of ever returning to the origin is strictly less than one.

He references a rather approachable paper explaining this by Shrirang Mare: Polya’s Recurrence Theorem which explains a proof of this matter using lattices of reisistors in an electrical circuit analogy.  The key is that there is infinite resistance to infinity in one or two dimensions, but strictly less than infinite resistance to infinity in three dimensions.

This is all fine, but there is another connection in science fiction. In 1959, E.E. “Doc” Smith’s The Galaxy Primes was published in Amazing Stories.

Our Heros have built a teleporting starship, but they can’t control where it goes.  The jumps appear long and random.  Garlock says to Belle:

“You can call that a fact. But I want you and Jim to do some math. We know that we’re making mighty long jumps. Assuming that they’re at perfect random, and of approximately the same length, the probability is greater than one-half that we’re getting farther and farther away from Tellus. Is there a jump number, N, at which the probability is one-half that we land nearer Tellus instead of farther away? My jump-at-conclusions guess is that there isn’t. That the first jump set up a bias.”

“Ouch. That isn’t in any of the books,” James said. “In other words, do we or do we not attain a maximum? You’re making some bum assumptions; among others that space isn’t curved and that the dimensions of the universe are very large compared to the length of our jumps. I’ll see if I can put it into shape to feed to Compy. You’ve always held that these generators work at random—the rest of those assumptions are based on your theory?”

Garlock is right – this is a three dimensional random walk and tends not to return to its starting place, but James is wrong when he says this isn’t in any of the books.  Polya proved it in 1921.



This might be the 1000th blog posting on this general topic, but for some reason, the complexity of booting grows year over year, sort of like the tax code.

Back in 2009, Win and I built three low power servers, using Intel D945GCLF2 mini-ITX motherboards with Atom 330 processors.  We put mirrored 1.5 Terabyte drives in them, and 2 GB of ram, and they have performed very well as pretty low power home servers.  We ran the then-current Ubuntu, and only sporadically ran apt-get update and apt-get upgrade.

Fast forward to this summer.  We wanted to upgrade the OS’s, but they had gotten so far behind that apt-get update wouldn’t work.  It was clearly necessary to reinstall.  Now one of these machines is our compound mail server, and another runs mythtv and various other services.  The third one was pretty idle, just hosting about a terabyte of SiCortex archives.  In a previous blog post I wrote about the month elapsed time it took me to back up that machine.

This post is about the adventure of installing Ubuntu 12.04 LTS on it.  (LTS is long term support, so that in principle, we will not have to do this again until 2017.  I hope so!)

Previously, SMART tools were telling us that the 2009 era desktop 1.5T drives were going bad, so I bought a couple of 3T WD Red NAS drives, like the ones in our Drobo 5N.  Alex (my 14 year old) and I took apart the machine and replaced the drives, with no problem.

I followed directions from the web on how to download an ISO and burn it to a USB drive using MacOS tools.   This is pretty straightforward, but not obvious.  First you have to convert the iso to a dmg, then use dd to copy it to the raw device:

hdiutil convert -format UDRW -o ubuntu-12.04.3-server-amd64.img ubuntu-12.04.3-server-amd64.iso
# Use diskutil list, then plug in a blank USB key >the image size, run diskutil list again to find the drive device.  (In my case /dev/disk2)
sudo dd if=ubuntu-12.04.3-server-amd64.img.dmg of=/dev/disk2 bs=1m
# notice the .dmg extension that MacOS insists on adding
diskutil eject /dev/disk2 (or whatever)

Now in my basement, the two servers I have are plugged into a USB/VGA monitor and keyboard switch, and it is fairly slow to react when the video signal comes and goes.  In fact it is so slow that you miss the opportunity to type “F2” to enter the BIOS to set the boot order.  So I had to plug in the monitor and keyboard directly, in order to enable USB booting.  At least it HAS USB booting, because these machines do not have optical drives, since they have only two SATA ports.

Anyway, I was able to boot the Ubuntu installer.  Now even at this late date, it is not really well supported to install onto a software RAID environment.  It works, but you have to read web pages full of advice, and run the partitioner in manual mode.

May I take a moment to rant?  PLEASE DATE YOUR WEB PAGES.  It is astonishing how many sources of potentially valuable information fail to mention the date or versions of software they apply to.

I found various pieces of advice, plus my recollection of how I did this in 2009, and configured root, swap, and /data as software RAID 1 (mirrored disks).  Ubuntu ran the installer, and… would not reboot.  “No bootable drives found”.

During the install, there was an anomaly, in that attempts to set the “bootable” flag on the root filesystem partitions failed, and when I tried it using parted running in rescue mode, it would set the bootable flag, but clear the “physical volume for RAID” flag.

I tried 12.04.  I tried 13.04.  I tried 13.04 in single drive (no RAID).  These did not work. The single drive attempt taught me that the problem wasn’t the RAID configuration at all.

During this process, I began to learn about GPT, or guid partition tables.

Disks larger than 2T can’t work with MBR (master boot record) style partition tables, because their integers are too small.  Instead, there is a new GPT (guid partition table) scheme, that uses 64 bit numbers.

Modern computers also have something called UEFI instead of BIOS, and UEFI knows about GPT partition tables.

The Ubuntu installer knows that large disks must use GPT, and does so

Grub2 knows this is a problem, and requires the existence of a small partition flagged bios_grub, as a place to stash its code, since GPT does not have the blank space after the sector 0 boot code that exists in the MBR world (which grub uses to stash code).

So Ubuntu creates the GPT, the automatic partitioning step creates the correct mini-partition for grub to use, and it seems to realize that grub should be installed on both drives when using an MD filesystem for root. (it used the command line grub-install /dev/sda /dev/sdb) Evidently the grub install puts a first stage loader in sector 0, and the second stage loader in the bios_grub partition.

Many web pages say you have to set the “bootable” flag on the MD root, but parted will not let you do this,because in GPT, setting a “bootable” flag is forbidden by the spec.  Not clear it would work anyway because when you set it, the “physical volume for raid” flag is turned off.

The 2009 Atom motherboards do not have a UEFI compatible BIOS, and are expecting an MBR. When they don’t find one, they give up.  If they would just load the code in sector 0 and jump to it it would work. I considered doing a bios update, but it wasn’t clear the 2010 release is different in this respect.

So the trick is to use FDISK to <create an MBR> with a null partition.  This is just enough to get past the Atom BIOS’ squeamishness and have it execute the grub loader, which then works fine using the GPT.  I got this final trick from whose final text is

boot off a live CD and run fdisk against the boot disk. It’ll give a bunch of scary warnings. Ignore them. Hit “a”, then “1”, then “w” to write it to disk. Things ought to work then.

The sequence of steps that worked is:

Run the installer
Choose manual disk partitioning
Choose "automatically partition" /dev/sda
This will create a 1 MB bios_grub partition and a 2GB swap, and make the rest rootDelete the root partition
Create a 100 GB partition from the beginning of the free space
Mark it "physical volume for RAID" with a comment that it is for root 
Use the rest of the free space (2.9T) to make a partition, mark it physical volume for raid.  Comment that it is for /data
Change the type of the swap partition to "physical volume for raid"
Repeat the above steps for /dev/sdb

Run "configure software RAID"
Create MD volume, using RAID 1 (mirrored)
Select 2 drives, with 0 spares
Choose the two swap partitions
Mark the resulting MD partition as swap 
Create MD volume, RAID 1, 2, and 0
Select the two 100 GB partitions
Mark them for use as EXT4, to be mounted on /
Create MD volume, RAID 1, 2, and 0
Select the two 2.9T partitions
Mark them for use as EXT4, to be mounted on /data 

(I considered BTRFS, but the most recent comments I could find still seem to regard it as flakey)

Save and finish installing Ubuntu

Pretend to be surprised when it won't boot.  "No bootable disks found"

Reboot from the installer USB, choose Rescue Mode
Step through it. Do not mount any file systems, ask for a shell in the installer environment.
When you get a prompt,

fdisk /dev/sda


fdisk /dev/sdb

^d and reboot. Done

Now I have a working Ubuntu 12.04 server with mirrored 3T drives.

Hypervisor Hijinks

At my office, we have a rack full of Tilera 64-core servers, 120 of them. We use them for some interesting video processing applications, but that is beside the point. Having 7680 of something running can magnify small failure rates to the point that they are worth tracking down. Something that might take a year of runtime to show up can show up once an hour on a system like this

Some of the things we see tend, with some slight statistical flavor, to occur more frequently on some nodes than on others. That just might make you think that we have some bad hardware. Could be. We got to wondering whether running the systems at slightly higher core voltages would make a difference, and indeed, one can configure such a thing, but basically you have to reprogram the flash bootloaders on 120 nodes. The easiest thing to do was to change both the frequency and the voltage, which isn’t the best thing to do, but it was easy. The net effect was to reduce the number of already infrequent faults on the nodes where they occurred, but to cause, maybe, a different sort of infrequent fault on a different set of nodes.

Yow. That is NOT what we wanted.

We were talking about this, and I said about the stupidest thing I’ve said in a long time. It was, approximately:

I think I can add some new hypervisor calls that will let us change the core voltage and clock frequency from user mode.

This is just a little like rewiring the engines of an airplane while flying, but if it were possible, we could explore the infrequent fault landscape much more quickly.

But really, how hard could it be?

Tilera, to their great credit, supplies a complete Multicore Development Environment which includes the linux kernel sources and the hypervisor sources.

The Tilera version of Linux has a fairly stock kernel which runs on top of a hypervisor that manages physical chip resources and such things as TLB refills. There is also a hypervisor “public” API, which is really not that public, it is available to the OS kernel. The Tilera chip has 4 protection rings. The hypervisor runs in kernel mode. The OS runs in supervisor mode, and user programs can run in the other two. The hypervisor API has things like load this page table context, or flush this TLB entry, and so forth.

As part of the boot sequence, one of the things the hypervisor does is to set the core voltage and clock frequency according to a little table it has. The voltage and frequency are set together, and the controls are not accessible to the Linux kernel or to applications. Now it is obviously possible to change the values while running, because that is what the boot code does. What I needed to do was to add some code to the hypervisor to get and set the voltage and frequency separately, while paying attention to the rules implicit in the table. There are minimum and maximum voltages and frequencies beyond which the chip will stop working, and there are likely values that will cause permanent damage. There is also a relation between the two – generally higher frequencies will require higher voltages. Consequently it is not OK to set the frequency too high for the current voltage, or to set the voltage too low for the current frequency.

Fine. Now I have subroutine calls inside the hypervisor. In order to make them available to a user mode program running under Linux, I have to add hypervisor calls for the new functions, and then add something like a loadable kernel module to Linux to call them and to make the functionality available to user programs.

The kernel piece is sort of straightforward. One can write a loadable kernel module that implements something called sysfs. These are little text files in a directory like /sys/kernel/tilera/ with names like “frequency” and “voltage”. Through the magic of sysfs, when an application writes a text string into one of these files, a piece of code in the kenel module gets called with the string. When an application reads one of these files, the kernel module gets called to provide the text.

Now, with the kernel module at the top, and the new subroutines in the hypervisor at the bottom, all I need to do is wire them together by adding new hypervisor calls.

Hypervisor calls make by linux are done by hypervisor glue. The glue area starts at 0x10000 above the base of the text area, and each possible call has 0x20 bytes of instructions available.

Sometimes, such as “nanosleep”, the call is implemented inline in those 0x20 bytes. Mostly, the code in the glue area loads a register with a call number and does a software interrupt.

The code that builds the glue area is is hv/tilepro/glue.S.

For example, the nanosleep code is

        /* Each time through the loop we consume three cycles and
         * therefore four nanoseconds, assuming a 750 MHz clock rate.
         * TODO: reading a slow SPR would be the lowest-power way
         * to stall for a finite length of time, but the exact delay
         * for each SPR is not yet finalized.
          sadb_u r1, r0, r0
          addi r0, r0, -4
          add r1, r1, r1 /* force a stall */
          bgzt r0, hv_nanosleep
        jrp lr
while most others are


or the like. where GENERIC_SWINT2 is a macro:

#define GENERIC_SWINT2(name)
        .align ALIGN ;
        moveli TREG_SYSCALL_NR_NAME, HV_SYS_##name ;
        swint2 ;
        jrp lr ;
The glue.S source code is written in a positional way, like

so the actual address of the linkage area for a particular call like flush_page depends on the exact sequence of items in glue.S. If you get them out of order or leave a hole, then the linkage addresses of everything later will be wrong. So to add a hypercall, you add items immediately after the last GENERIC_SWINT2 or ILLEGAL_SWINT2

In the case of the set_voltage calls we have:


With this fixed point, we work in both directions, down into the hypervisor to add the call and up into linux to add something to call it.

Looking back at the GENERIC_SWINT2 macro, it loads a register with the value of a symbol like HV_SYS_##name where name is the argument to GENERIC_SWINT2. This is using the C preprocessor stringification operator ## that concatenates. So


expects a symbol named HV_SYS_get_voltage. IMPORTANT NOTE – the value of this symbol has nothing to do with the hypervisor linkage area, it is only used in the swint2 implementation. The HV_SYS_xxx symbols are defined in hv/tilepro/syscall.h and are used by glue.S to build the code in the hypervisor linkage area and also used by hv/tilepro/intvec.S to build the swint2 handler.

In hv/tilepro/intvec.S we have things like

 syscall HV_SYS_flush_all,	      syscall_flush_all
syscall	HV_SYS_get_voltage,	            syscall_get_voltage

in an area called the syscall_table with the comment

// System call table.  Note that the entries must be ordered by their
// system call numbers (as defined in syscall.h), but it's OK if some numbers
// are skipped, or if some syscalls exist but aren't present in the table.

where syscall is a Tilera assembler macroL

.macro  syscall number routine
      .org    syscall_table + ((number) * 4)
      .word   routine

And indeed, the use of .org makes sure that the offset of the entry in the syscall table matches the syscall number. The second argument is a symbol elsewhere in the hypervisor sources of code that implements the function.

In the case of syscall_get_voltage, the code is in hv/tilepro/hw_config.c:

int syscall_get_voltage(void)

So at this point, if something in the linux kernel manages to transfer control to text + 0x10000 + whatever the offset of the code in glue.S is, then a swint2 with argument HV_SYS_get_voltage will be made, which will transfer control in hypervisor mode to the swint2 handler, which will make a function call to syscall_get_voltage in the hypervisor.

But what is the offset in glue.S?

It is whatever you get incrementally by assembling glue.S, but in practice, it had better match the values given in the “public hypervisor interface” which is defined in hv/include/hv/hypervisor.h

hv/include/hv/hypervisor.h has things like

/** hv_flush_all */
#define HV_DISPATCH_FLUSH_ALL                     55

/** hv_get_ipi_pte */
#define HV_DISPATCH_GET_IPI_PTE                   56

/* added by QRC */
/** hv_get_voltage */
#define HV_DISPATCH_GET_VOLTAGE               57

and these numbers are similar to, but not identical to thos in syscall.h. Do not confuse them!

Once you add the entries to hypervisor.h, it is a good idea to check them against what is actually in the glue.o file. You can use tile-objdump for this:

tile-objdump -D glue.o

which generates:

00000700 <hv_get_ipi_pte>:
     700:	1fe6b7e070165000	{ moveli r0, -810 }
     708:	081606e070165000 	{ jrp lr }
     710:	400b880070166000 	{ nop ; nop }
     718:	400b880070166000 	{ nop ; nop }
00000720 <hv_get_voltage>:
     720:	1801d7e570165000	{ moveli r10, 58 }
     728:	400ba00070166000 	{ swint2 }
     730:	081606e070165000 	{ jrp lr }
     738:	400b280070165000 	{ fnop }

and if you divide HEX 720 by HEX 20 you get

I use bc for this sort of mixed-base calculating:

stewart$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.

and we see that we got it right, the linkage number for get_voltage is indeed 57

Now let’s turn to Linux. The archtecture dependent stuff for Tilera is in src/sys/linux/arch/tile

The idea is to build a kernel module that will implement a sysfs interface to the new voltage and frequency calls.

The module get and set routines will call hv_set_voltage and hv_get_voltage.

The hypervisor call linkage is done by linker magic, via a file arch/tile/kernel/, which is a linker script. In other words, the kernel has no definitions for these hv_ symbols, they are defined at link time by the linker script. For each hv call, it has a line like

hv_get_voltage = TEXT_OFFSET + 0x10740;

and you will recognize our friend 0x740 as the offset of this call in the hypervisor linkage area. Unfortunately, this doesn’t help with a separatley compiled module because it doesn’t have a way to use such a script (when I try it, TEXT_OFFSET is undefined, presumably that is part of the kernel main linker script. )

So to make a hypervisor call from a loadable module, you need a trampoline. I put them in arch/tile/kernel/qrc_extra.c, like this

int qrc_hv_get_voltage(void)
  int v;
  printk("Calling hv_get_voltage()n");
  v = hv_get_voltage();
  printk("hv_get_voltage returned %dn", v);

The EXPORT_SYMBOL is needed to let modules use the function.

But where did come from? It turns out it is not in any Makefile, but rather is made by a perl script in sys/hv/, except that this perl script optionally writes assembler or linker script output and I had to modify it to select the right branch. The modified version is and is invoked like this:

perl ../../hv/ ../../hv/include/hv/hypervisor.h >

The hints for this come from the sys/bogux/Makefile which does something similar for the bogux example supervisor.

linux/arch/tile/include/hv/hypervisor.h is a near copy of sys/hv/include/hv/hypervisor.h, but they are not automatically kept in sync.

Somehow I think that adding hypervisor calls is not a frequently exercised path.

To recap, you need to:

  • have the crazy idea to add hypervisor calls to change the chip voltage at runtime
  • edit hypervisor.h to choose the next available hv call number
  • edit glue.S to add, in just the right place, a macro call which will have the right offset in the file to match the hv call number
  • edit syscall.h to create a similar number for the SWINT2 interrupt dispatch table
  • edit intvec.S to add the new entry to the SWINT2 dispatch table
  • create the subroutine to actually be called from the dispatch table
  • run the magic perl script to transform hypervisor.h into an architecture dependent linker script to define the symbols for the new hv calls in the linux kernel
  • add trampolines for the hv calls in the linux kernel so you can call them from a  loadable module.
  • write a kernel module to create sysfs files and turn reads and writes into calls on the new trampolines
  • write blog entry about the above


Another thing not to do

At the day job, I’ve been writing a new version of nbd-client.  Instead of handing an open tcp socket to the kernel, it hands the kernel one end of a unix domain socket and keeps the other end for itself.  This creates a block device where the data is managed by a user mode program on the same system.

In regular nbd-client, the last thing the program does is call ioctl(fd, NBD_DO_IT), which doesn’t return.  The thread is used by the device driver to read and write the socket without blocking other activity in the block layer.

Because I need the program around to do other work, I called pthread_create to make a thread to call the ioctl.

Then I ran my program under gdb (as root!).

In another window, I typed dd if=/dev/nbd0 bs=4096 count=1

In the gdb window I saw

nbd-userland.c:525: server_read_fn: Assertion `0′ failed.

and my dd hung, and the gdb hung, and neither could be killed by ^C

I was able to get control back by using the usual big hammer, kill -9 <gdb>

So what happened?  My user mode thread hit an assertion, and gave control to gdb, which tried to halt the other threads in the process, which didn’t work because the thread in the middle of the ioctl was in the middle of something uninterruptible, and the gdb thread trying to do this also became uninterruptible while waiting.

It is going to be hard to debug this program like this.

The fix, however, is fairly clear:  use fork(2) instead of pthread_create() to create a thread to call ioctl. It will be isolated from the part of the program hitting the assertion.

Older and wiser,


By the way, when you are trying to figure out where processes are stuck, look at the “wchan” field of ps axl.  It will be a kernel symbol that will give you a clue about what the thread is waiting for.


Experience is what lets you recognize a mistake when you make it again.

The underlying bug was sending too much data on the wire.  Like this:

struct network_request_header {
uint64_t offset;
uint32_t size;

write(fd, net_request, sizeof(struct network_request_header);

Well, no.  sizeof(struct network_request_header) turns out to be 16, rather than, say, 12.  If you think about it, this makes perfect sense, because otherwise an array of these things would have unaligned uint64_t’s every other time.  You can’t do network I/O this way, especially if the program on the other end uses a different language or different compiler.

gdb, it turns out, has a feature:  __attribute__((packed)) that makes this work, but it is not portable to other compilers.