At the day job, I’ve been writing a new version of nbd-client. Instead of handing an open tcp socket to the kernel, it hands the kernel one end of a unix domain socket and keeps the other end for itself. This creates a block device where the data is managed by a user mode program on the same system.
In regular nbd-client, the last thing the program does is call ioctl(fd, NBD_DO_IT), which doesn’t return. The thread is used by the device driver to read and write the socket without blocking other activity in the block layer.
Because I need the program around to do other work, I called pthread_create to make a thread to call the ioctl.
Then I ran my program under gdb (as root!).
In another window, I typed dd if=/dev/nbd0 bs=4096 count=1
In the gdb window I saw
nbd-userland.c:525: server_read_fn: Assertion `0′ failed.
and my dd hung, and the gdb hung, and neither could be killed by ^C
I was able to get control back by using the usual big hammer, kill -9 <gdb>
So what happened? My user mode thread hit an assertion, and gave control to gdb, which tried to halt the other threads in the process, which didn’t work because the thread in the middle of the ioctl was in the middle of something uninterruptible, and the gdb thread trying to do this also became uninterruptible while waiting.
It is going to be hard to debug this program like this.
The fix, however, is fairly clear: use fork(2) instead of pthread_create() to create a thread to call ioctl. It will be isolated from the part of the program hitting the assertion.
Older and wiser,
Larry
By the way, when you are trying to figure out where processes are stuck, look at the “wchan” field of ps axl. It will be a kernel symbol that will give you a clue about what the thread is waiting for.
UPDATE
Experience is what lets you recognize a mistake when you make it again.
The underlying bug was sending too much data on the wire. Like this:
struct network_request_header {
uint64_t offset;
uint32_t size;
};
write(fd, net_request, sizeof(struct network_request_header);
Well, no. sizeof(struct network_request_header) turns out to be 16, rather than, say, 12. If you think about it, this makes perfect sense, because otherwise an array of these things would have unaligned uint64_t’s every other time. You can’t do network I/O this way, especially if the program on the other end uses a different language or different compiler.
gdb, it turns out, has a feature: __attribute__((packed)) that makes this work, but it is not portable to other compilers.