Hypervisor Hijinks

At my office, we have a rack full of Tilera 64-core servers, 120 of them. We use them for some interesting video processing applications, but that is beside the point. Having 7680 of something running can magnify small failure rates to the point that they are worth tracking down. Something that might take a year of runtime to show up can show up once an hour on a system like this
Some of the things we see tend, with some slight statistical flavor, to occur more frequently on some nodes than on others. That just might make you think that we have some bad hardware. Could be. We got to wondering whether running the systems at slightly higher core voltages would make a difference, and indeed, one can configure such a thing, but basically you have to reprogram the flash bootloaders on 120 nodes. The easiest thing to do was to change both the frequency and the voltage, which isn’t the best thing to do, but it was easy. The net effect was to reduce the number of already infrequent faults on the nodes where they occurred, but to cause, maybe, a different sort of infrequent fault on a different set of nodes.
Yow. That is NOT what we wanted.
We were talking about this, and I said about the stupidest thing I’ve said in a long time. It was, approximately:

I think I can add some new hypervisor calls that will let us change the core voltage and clock frequency from user mode.

This is just a little like rewiring the engines of an airplane while flying, but if it were possible, we could explore the infrequent fault landscape much more quickly.
But really, how hard could it be?
Tilera, to their great credit, supplies a complete Multicore Development Environment which includes the linux kernel sources and the hypervisor sources.
The Tilera version of Linux has a fairly stock kernel which runs on top of a hypervisor that manages physical chip resources and such things as TLB refills. There is also a hypervisor “public” API, which is really not that public, it is available to the OS kernel. The Tilera chip has 4 protection rings. The hypervisor runs in kernel mode. The OS runs in supervisor mode, and user programs can run in the other two. The hypervisor API has things like load this page table context, or flush this TLB entry, and so forth.
As part of the boot sequence, one of the things the hypervisor does is to set the core voltage and clock frequency according to a little table it has. The voltage and frequency are set together, and the controls are not accessible to the Linux kernel or to applications. Now it is obviously possible to change the values while running, because that is what the boot code does. What I needed to do was to add some code to the hypervisor to get and set the voltage and frequency separately, while paying attention to the rules implicit in the table. There are minimum and maximum voltages and frequencies beyond which the chip will stop working, and there are likely values that will cause permanent damage. There is also a relation between the two – generally higher frequencies will require higher voltages. Consequently it is not OK to set the frequency too high for the current voltage, or to set the voltage too low for the current frequency.
Fine. Now I have subroutine calls inside the hypervisor. In order to make them available to a user mode program running under Linux, I have to add hypervisor calls for the new functions, and then add something like a loadable kernel module to Linux to call them and to make the functionality available to user programs.
The kernel piece is sort of straightforward. One can write a loadable kernel module that implements something called sysfs. These are little text files in a directory like /sys/kernel/tilera/ with names like “frequency” and “voltage”. Through the magic of sysfs, when an application writes a text string into one of these files, a piece of code in the kenel module gets called with the string. When an application reads one of these files, the kernel module gets called to provide the text.
Now, with the kernel module at the top, and the new subroutines in the hypervisor at the bottom, all I need to do is wire them together by adding new hypervisor calls.
Hypervisor calls make by linux are done by hypervisor glue. The glue area starts at 0x10000 above the base of the text area, and each possible call has 0x20 bytes of instructions available.
Sometimes, such as “nanosleep”, the call is implemented inline in those 0x20 bytes. Mostly, the code in the glue area loads a register with a call number and does a software interrupt.
The code that builds the glue area is is hv/tilepro/glue.S.
For example, the nanosleep code is

hv_nanosleep:
        /* Each time through the loop we consume three cycles and
         * therefore four nanoseconds, assuming a 750 MHz clock rate.
         *
         * TODO: reading a slow SPR would be the lowest-power way
         * to stall for a finite length of time, but the exact delay
         * for each SPR is not yet finalized.
         */
        {
          sadb_u r1, r0, r0
          addi r0, r0, -4
        }
        {
          add r1, r1, r1 /* force a stall */
          bgzt r0, hv_nanosleep
        }
        jrp lr
        fnop
while most others are

GENERIC_SWINT2(set_caching)
or the like. where GENERIC_SWINT2 is a macro:

#define GENERIC_SWINT2(name)
        .align ALIGN ;
hv_##name:
        moveli TREG_SYSCALL_NR_NAME, HV_SYS_##name ;
        swint2 ;
        jrp lr ;
        fnop
The glue.S source code is written in a positional way, like
GENERIC_SWINT2(get_rtc)
GENERIC_SWINT2(set_rtc)
GENERIC_SWINT2(flush_asid)
GENERIC_SWINT2(flush_page)

so the actual address of the linkage area for a particular call like flush_page depends on the exact sequence of items in glue.S. If you get them out of order or leave a hole, then the linkage addresses of everything later will be wrong. So to add a hypercall, you add items immediately after the last GENERIC_SWINT2 or ILLEGAL_SWINT2
In the case of the set_voltage calls we have:

ILLEGAL_SWINT2(get_ipi_pte)
GENERIC_SWINT2(get_voltage)
GENERIC_SWINT2(set_voltage)
GENERIC_SWINT2(get_frequency)
GENERIC_SWINT2(set_frequency)

With this fixed point, we work in both directions, down into the hypervisor to add the call and up into linux to add something to call it.
Looking back at the GENERIC_SWINT2 macro, it loads a register with the value of a symbol like HV_SYS_##name where name is the argument to GENERIC_SWINT2. This is using the C preprocessor stringification operator ## that concatenates. So

GENERIC_SWINT2(get_voltage)

expects a symbol named HV_SYS_get_voltage. IMPORTANT NOTE – the value of this symbol has nothing to do with the hypervisor linkage area, it is only used in the swint2 implementation. The HV_SYS_xxx symbols are defined in hv/tilepro/syscall.h and are used by glue.S to build the code in the hypervisor linkage area and also used by hv/tilepro/intvec.S to build the swint2 handler.
In hv/tilepro/intvec.S we have things like

 syscall HV_SYS_flush_all,	      syscall_flush_all
syscall	HV_SYS_get_voltage,	            syscall_get_voltage

in an area called the syscall_table with the comment

// System call table.  Note that the entries must be ordered by their
// system call numbers (as defined in syscall.h), but it's OK if some numbers
// are skipped, or if some syscalls exist but aren't present in the table.

where syscall is a Tilera assembler macroL

.macro  syscall number routine
      .org    syscall_table + ((number) * 4)
      .word   routine
      .endm

And indeed, the use of .org makes sure that the offset of the entry in the syscall table matches the syscall number. The second argument is a symbol elsewhere in the hypervisor sources of code that implements the function.
In the case of syscall_get_voltage, the code is in hv/tilepro/hw_config.c:

int syscall_get_voltage(void)
{
  return(whatever);
}

So at this point, if something in the linux kernel manages to transfer control to text + 0x10000 + whatever the offset of the code in glue.S is, then a swint2 with argument HV_SYS_get_voltage will be made, which will transfer control in hypervisor mode to the swint2 handler, which will make a function call to syscall_get_voltage in the hypervisor.
But what is the offset in glue.S?
It is whatever you get incrementally by assembling glue.S, but in practice, it had better match the values given in the “public hypervisor interface” which is defined in hv/include/hv/hypervisor.h
hv/include/hv/hypervisor.h has things like

/** hv_flush_all */
#define HV_DISPATCH_FLUSH_ALL                     55
#if CHIP_HAS_IPI()
/** hv_get_ipi_pte */
#define HV_DISPATCH_GET_IPI_PTE                   56
#endif
/* added by QRC */
/** hv_get_voltage */
#define HV_DISPATCH_GET_VOLTAGE               57

and these numbers are similar to, but not identical to thos in syscall.h. Do not confuse them!
Once you add the entries to hypervisor.h, it is a good idea to check them against what is actually in the glue.o file. You can use tile-objdump for this:

tile-objdump -D glue.o

which generates:

...
00000700 <hv_get_ipi_pte>:
     700:	1fe6b7e070165000	{ moveli r0, -810 }
     708:	081606e070165000 	{ jrp lr }
     710:	400b880070166000 	{ nop ; nop }
     718:	400b880070166000 	{ nop ; nop }
00000720 <hv_get_voltage>:
     720:	1801d7e570165000	{ moveli r10, 58 }
     728:	400ba00070166000 	{ swint2 }
     730:	081606e070165000 	{ jrp lr }
     738:	400b280070165000 	{ fnop }
...

and if you divide HEX 720 by HEX 20 you get
I use bc for this sort of mixed-base calculating:

stewart$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
ibase=16
720/20
57
^Dstewart$

and we see that we got it right, the linkage number for get_voltage is indeed 57
Now let’s turn to Linux. The archtecture dependent stuff for Tilera is in src/sys/linux/arch/tile
The idea is to build a kernel module that will implement a sysfs interface to the new voltage and frequency calls.
The module get and set routines will call hv_set_voltage and hv_get_voltage.
The hypervisor call linkage is done by linker magic, via a file arch/tile/kernel/hvglue.lds, which is a linker script. In other words, the kernel has no definitions for these hv_ symbols, they are defined at link time by the linker script. For each hv call, it has a line like

hv_get_voltage = TEXT_OFFSET + 0x10740;

and you will recognize our friend 0x740 as the offset of this call in the hypervisor linkage area. Unfortunately, this doesn’t help with a separatley compiled module because it doesn’t have a way to use such a script (when I try it, TEXT_OFFSET is undefined, presumably that is part of the kernel main linker script. )
So to make a hypervisor call from a loadable module, you need a trampoline. I put them in arch/tile/kernel/qrc_extra.c, like this

int qrc_hv_get_voltage(void)
{
  int v;
  printk("Calling hv_get_voltage()n");
  v = hv_get_voltage();
  printk("hv_get_voltage returned %dn", v);
  return(v);
}
EXPORT_SYMBOL(qrc_hv_get_voltage);

The EXPORT_SYMBOL is needed to let modules use the function.
But where did hvglue.lds come from? It turns out it is not in any Makefile, but rather is made by a perl script in sys/hv/mkgluesyms.pl, except that this perl script optionally writes assembler or linker script output and I had to modify it to select the right branch. The modified version is mkgluesymslinux.pl and is invoked like this:

perl ../../hv/mkgluesymslinux.pl ../../hv/include/hv/hypervisor.h >hvglue.lds

The hints for this come from the sys/bogux/Makefile which does something similar for the bogux example supervisor.
linux/arch/tile/include/hv/hypervisor.h is a near copy of sys/hv/include/hv/hypervisor.h, but they are not automatically kept in sync.
Somehow I think that adding hypervisor calls is not a frequently exercised path.
To recap, you need to:

  • have the crazy idea to add hypervisor calls to change the chip voltage at runtime
  • edit hypervisor.h to choose the next available hv call number
  • edit glue.S to add, in just the right place, a macro call which will have the right offset in the file to match the hv call number
  • edit syscall.h to create a similar number for the SWINT2 interrupt dispatch table
  • edit intvec.S to add the new entry to the SWINT2 dispatch table
  • create the subroutine to actually be called from the dispatch table
  • run the magic perl script to transform hypervisor.h into an architecture dependent linker script to define the symbols for the new hv calls in the linux kernel
  • add trampolines for the hv calls in the linux kernel so you can call them from a  loadable module.
  • write a kernel module to create sysfs files and turn reads and writes into calls on the new trampolines
  • write blog entry about the above