I answered this question on Quora, but moderation deleted it, I guess because it references SiCortex, which has been shut down since 2009.
I am afraid I may be guilty of a little bit of pride here.
My information is also dated.
In 2007, the 5832 core 972 node SiCortex SC5832 could boot and be ready to run jobs in 7 minutes if the system support processor linux server was already running. From power off it would take about 9 or 10 minutes, with the extra time taken for the SSP to boot.
At the time, we had heard horror stories about clusters taking “hours” to boot, such that sysadmins were very reluctant to update software because it would take so long.
Earlier, in 2004 when we started the company, John Mucci asked me and the software team how long it was going to take to boot, and we said “5 minutes” to considerable eyebrow raising from people with more experience. Honestly we were guessing, but we couldn’t think of reasons why it should take longer.
Fast forward two years and we had to deliver. The machine was 36 boards, each with 27 6-core nodes and a little embedded Coldfire processor called the module support processor. The 972 nodes had no storage at all, so we had to boot over JTAG and load small ramdisk images. Then we had to initialize the high speed network, NFS mount the real root filesystem, and bring up the job control system. After a couple months of heroic efforts, we got it down to 7 minutes.
This was extraordinary in the industry, but we still got a lot of good natured razzing from the rest of the company for missing our 5 minute estimate.
To the software team, the most amusing part of the whole affair was that the hardware and software proved so reliable that we had several systems in the field with uptime over a year. With uptime like that it doesn’t really matter whether it takes 5 minutes to boot or 7 minutes or an hour for that matter.
According to the Obama administration, between 2009 and 2015, 473 drone strikes killed about 2500 combatants and about 100 non combatants.
Last week, the Dallas Police department used a robot to kill the police shooter.
As far as I know, all of these events have had human operators, supposedly exercising human judgement.
The thing is, many reports about drones and robots leave one with the impression that these are autonomous devices, without a human in the loop. It isn’t like that.
I do not think there is a real difference between a sniper on a hilltop killing from a mile away and a drone operator killing from 10,000 miles away. Both have a human pulling the trigger. We can and should talk about ways to further reduce non-combatant deaths, but sniper rifles and drones are much safer for our guys than bayonets and hand grenades.
The real discussion ought to be about autonomous vs human-in-the-loop.
The unfortunate fact is, we already have lots of truly autonomous devices killing people on their own initiative. They are called land mines.
About two weeks a year, it gets hot enough and humid enough here in Massachusetts to push us into turning on the air conditioning.
For the first few years of the century, after the house was built and we moved in, everything was fine, but in recent years not so much. We have different AC zones, and separate systems for each. Each year, typically, one or two of the units don’t work. Not work as in blow hot air instead of cold. I then go outside around back and discover that the fan in the outside unit isn’t spinning. Until last year, I’ve always been able to fix this problem by reaching through the grill and unsticking the fan with a screwdriver, or in the worst case, by taking the fan and motor off and whaling on it with a hammer. Evidently, enough moisture gets into the motor bearings over the winter to seize them beyond the motor’s starting torque’s ability to spin.
Brief Digression on AC
Air conditioners work by expanding a high pressure gas or fluid like freon through a nozzle into a low pressure gas. As a consequence of the ideal gas law, the expanding gas gets cold. It is then run through a heat exchanger inside the house, where the cold gas absorbs heat from the room air. (There is usually a fan to push the room air through the radiator fins of the heat exchanger. The expanded gas is then piped outside to a compressor. The compressor squeezes the working fluid, which according to the gas law, heats it up. Because heat was absorbed from the room, the compressed gas is now hotter than it was originally. It is then run through the outside heat exchanger, when a fan blows warm outside air past it to absorb the heat from the (hot) compressed gas. (I am using “gas” and “working fluid” interchangeably here. In fact, I think freon is one of those things that turns into a liquid at high pressure, so there is a phase change involved as well.) if the outside fan doesn’t work, then the there is nothing to cool off the compressed gas, and the whole outside unit eventually gets so hot that the thermal overload switch in the compressor shuts it off. This is why fixing the outside fan fixes the whole AC.
Well last year, one unit’s fan wasn’t spinning, but wasn’t stuck either. There are only three reasons why that could be: no power, bad motor, or bad capacitor. I was able to measure that the power was present, and it was cheaper to replace the capacitor, and that fixed it. Except that my measurements seemed to indicate there was nothing wrong with the old capacitor. I had fixed a loose push-on connector, so I wrote off the experience.
This year, same problem, same unit. The motor was not stuck, but wasn’t spinning either.
Brief digression about induction motors
Electric motors work by having a spinning magnet (the rotor) driven by a stationary magnet (the field). Now the magnets are going to want to line up north pole opposite south pole, and stay that way, so there also has to be something that makes “north” spin. Some motors have the rotor or the field be a permanent magnets with the other being an electromagnet, while other motors have electromagnets for both field and rotor. If the rotor is an electromagnet, there will often be brushes to supply power to the rotor. An induction motor is kind of strange, in that both the field and the rotor are electromagnets, but the power for the rotor is supplied by induction, with no physical connection.
A three phase induction motor is fairly easy to understand. The field has three windings, fed by the three phases. They are rotated with respect to on another by 120 degrees. As the current in phase “A” dies down, the current in phase “B” is picking up, and as a consequence the direction of North in the field windings rotates by 120 degrees. With three phases, you get a nice rotating field, and the rotor follows it, with just enough lag to generate an induced current in the rotor to create the rotor magnetic field. A single phase induction motor is different, the field merely reverses 120 times a second. If the rotor is spinning, then it will keep spinning, but there is nothing to get it started! To solve this problem, single phase induction motors have a capacitor. The capacitor is connected in series with another field winding that is rotated with respect to the main winding. Due to the properties of capacitors, the current in this starting winding will be advanced with respect to the current in the main winding. This gives enough of a rotating field to get the rotor started spinning. In fact, if you have an open circuit starting capacitor, you can sometimes start the motor by hand by giving it a spin yourself.
Because it seemed really unlikely that the new capacitor failed over the winter, I resolved to replace the motor. The problem was that I could not get the fan off the motor shaft!. The steel shaft was pretty well rusted together with the steel fan hub into a single glob. Repeated application of WD40 and hammers and so forth did nothing. By suitable pounding, I could move the fan axially towards the motor. By supporting the fan and pounding on the shaft, I could move it back, but hammering on the shaft was mushrooming the end of the shaft, so there would be no way to get the fan off. The usual tool for this problem is a gear puller, but a two-fingered gear puller won’t work with a three bladed fan. I have some nice pipe wrenches with which to twist the shaft against the hub, but the fan was too close to the motor for the wrench to fit, and the motor shaft didn’t come out the other end of the motor.
My solution to this is somewhat destructive! I used my angle grinder with a metal cutting wheel to take the motor apart. By grinding off six rivets I was able to get the back of the motor off, but there was nothing to grab with the wrench. I then used the cutting wheel to cut all the way around the fan end of the motor housing, at which point the field assembly came off, revealing the rotor. I could then grab the rotor with one wrench and the fan hub with the other and twist them apart.
This whole exercise was destructive and messy, and no doubt a new fan would be less trouble overall, but it sure was fun.
Fifteen years ago when we built our house, we had a home security system installed. It has the usual alarm panel with a keypad inside the door. When you come in the house, you have 30 seconds to key in your password to stop the alarm from going off.
If the alarm does go off, the monitoring company will call you to find out if it was a mistake or a real alarm. Each authorized user has a passcode to authenticate themselves to the monitoring company. You can’t have the burglar answering the phone “No problem here! False alarm…”
In fact, there are two passcodes, one authenticates you, and the other is a duress password. If the burglar is there with you, you use the duress password, and the monitoring company behaves exactly the same way, but they also call the local police for you. It is important that the burglar cannot tell the difference.
It seems to me that ATM cards should have duress PINs as well as real ones. If a criminal says “type in your ATM pin or else” then fine, you enter the duress PIN. The ATM behaves exactly the same way, but the bank alerts the police and sends them the surveillance video.
Duress passwords have a lot of other potential uses. If your school principal demands your facebook password, you give up your duress password. What happens next could depend on which password you give. At the extreme, your whole account could be deleted. It could be archived on servers out of legal jurisdiction, your stuff visible only to friends could seem not to exist for a week. Whatever. Options that appear not to do anything are best, because then the school admins can’t tell you have disobeyed them and suspend you.
While I am riffing, there should be a phrase you can say, like “I do not consent to this search” or a similar account setting, that makes the administrator’s access an automatic CFAA violation. (I think the CFAA should be junked, but if not, it should be used to user’s benefit, not just the man’s.)
Finally, regarding authentication, there should also be two-factor authentication for everything, and single-use passwords for everything. Why not? Everyone has a nice computing device with them at all times. Of course your phone and the authentication app should have a duress unlock code.
So next time you are building an authentication structure, build in support for one-time passwords, two factor authentication, and a flexible set of duress passwords.
It has finally gotten cold here. Right now is it about 17F outside. Previously we had been getting by with just the heating zones for the kitchen/family room and the master bedroom turned on.
A few days ago, the boys had trouble getting to sleep while we were watching TV, because the noise from the set was keeping them up. Alex closed the door. The next morning, I noticed it was 55F in their room. Well, I reasoned, the heating zone up there is not turned on, and with the door shut, warm air from the rest of the house can’t get in so easily. I turned on the heat. The next night Alex happened to close the door again, and in the morning it was 52F. That isn’t so good.
Friday we had the neighbors over for dinner so I turned on the dining room heat. A couple hours later I went to check on it and it wasn’t any warmer.
This is our heating system. This is a gas fired hot water system. The “boiler” is the green box on the lower left. It heats water to 160F or so. From there, there are 9 heating zones. The horizontal pipe manifold in the front is the the return path to the boiler. The vertical pipes with yellow shutoffs representing the returns for each zone. The supply manifold is behind, along with the pumps and so forth. One zone heats water in the blue tank for domestic hot water faucets and showers. The other zones have circulating pumps that feed tubing that zigzags under the floors . This is called radiant heating.
Each zone typically has a manifold like this one that routes hot water through synthetic rubber tubes that are stapled to the undersize of the floors, and insulated below that to direct their heat upwards. This lets you walk around on warm floors and actually get by with colder air temperatures. Our oldest daughter was in the habit of leaving the next day’s clothes on the floor covered with a blanket, so they would be prewarmed in the morning. Notice that one tube is turned off. That one runs underneath the kitchen pantry, which we try to keep colder.
In the main system photo, on the left, you can see electronics boxes on the wall. Here’s a closeup.
Each zone has a thermostat, which comes into one of these boxes. This is a three channel box, with three 24 volt thermostats coming in on brown wires at the top, and red wiring for three 120 volt zone circulator pumps at the bottom. The box also signals the main boiler that heat is being called by at least one zone. Each zone has a plug in relay, one of which I have unplugged.
The circulator pumps look like this
So there is a central gas water heater, which feeds a number of zones. Each zone has a water circulation pump, controlled by a thermostat. The pump feeds hot water through rubber tubes on the underside of the floors.
Individual zones have failed before. I have fixed them by replacing the circulator pump. You can get these anywhere.
The hardest part about replacing these is the electrical wiring, which is hardwired by wirenuts in the green box attached to the pump. First, turn off the power. I did this by physically pulling the relay for the appropriate zone. Then I measure the pump current using a clamp on ammeter. Then I measure the voltage. Only then do I unscrew the wirenuts protecting the wires, and without touching the bare wires, touch the end to ground. Then brush the wire with the back of your hand only. If the wire is live, the electricity will contract your arm muscles, pulling your hand away. If you can’t think of at least four ways to make sure the wires are not live, hire someone to do this for you. Really. There are old electricians, and there are bold electricians. There are no old, bold electricians. I am an old electrician.
Our system has shutoff valves immediately on both sides of the pump. By turning those off, you can swap out the pump without draining all the water out of the system. As you can see in the picture, the pump is held in place by flanges at the inlet (bottom) and outlet (top). Each flange has two stainless steel bolts, so they won’t rust. In a burst of cleverness or good design, the nuts on these bolts are 11/16 and the bolts themselves are 5/8, so you can take them apart with only one set of wrenches. Here’s the pump I removed.
Note the corrosion inside the pump. I put the new pump in place and turned this zone back on, and now the dining room was getting heat. While I was down there, I took a look at this thing.
This is an air removal valve. It is installed on top of the boiler, along with a pressure relief valve. On some intuition, I lifted the pressure relief valve toggle, and air came out, followed by water. That is not good. The water for a heating system like this comes from town water, which has dissolved gas in it. Typically this will be air, although in the Marcellus Shale areas it can be natuural gas (in those areas, you can set your sink on fire). Air is bad for forced hot water systems. it corrodes the inside of the pipes, and water pumps won’t pump air, usually. If the radiant tubes get full of air, they will not be heating. By the way, these pipes are so rusty because some years ago the boiler was overheating to the point that the relief valve was opening, getting water everywhere. This was because the temperature sensor had come unstuck from the pipe it was measuring. Fixed by a clever plumber with a stainless pipe clamp. As collateral damage from rapid cycling, I had to get a new gas valve too. Separate story.
After waiting a few few minutes, I tried the relief valve again and got more air. This meant that the air removal valve wasn’t working, and probably some of my zones weren’t working because of air-bound pumps or bubbles in the pipes. You might be wondering how the valve knows to let out air, but not water. Inside the cylinder is a float. When there is water inside, the float rises and closes the output port. When there is air inside, the float falls, opening the outlet port and letting out the air. It is pretty simple. I called a plumber friend to see if he could fix this and he said “if you can replace a zone pump, you can replace this valve too.” Basically, you turn off the system, close all the valves, to minimize the amount of water that will come out, depressurize the system, and work fast. A new valve was $13 at Home Depot. The fact they had 10 in stock suggests they do go bad. Unfortunatly I failed to depressurize the system as well as I thought, and I got a 3 foot high gusher of 130F water. Be careful! Heating systems run at around 10 psi. The pressure comes partly from town water pressure through a pressure regulator, and partly from the expansion of hot water. There is an expansion tank to reduce that effect.
The next day, I tried the pressure relief valve again and got water immediately. Probably this means the new valve is working.
Each zone has a temperature gauge. You can see that the two on the right are low, and the two on the left in this picture are not. The right hand zone had the pump I replaced. The next one was not turned on. The temperature gauges are there because you don’t want to run 160F water through these radiant tubes. The floors will get too hot and the tubes won’t last very long. Instead, each zone has a check valve and a mixing valve.
The check valve keeps the loop from flowing backwards, or generally keeps it from circulating by gravity. Cold water is slightly denser than hot water, so the water on the colder side of the loop will fall, pulling hot water around the loop even without the pump running. The spring in the check valve is enough to stop gravity circulation.
The mixing valve has a green adjusting knob. This valve mixes hot water from the boiler with cooler water from the return leg of the zone, and serves to adjust the temperature of the water in each zone. Some water recirculates, with some hot water added.
When I turned on the zone second from the right, it did not work. The temperature gauge stayed put at 80F, (conducted heat through the copper pipes). I used my ammeter to confirm the pump was drawing power. I turned off the valves for all the other zones, so that this one would have more water. Didn’t work.
There are three reasons why a hot water zone might not work: the pump is not spinning, the pump is trying to pump air, or the pipes are clogged. I had just replaced a pump to fix a zone, but was there a second bad pump? Or something else?
I have an intra-red non-contact thermometer, and I used it to measure the pump housing temperatures. The working pumps were all around 125F, the non working pump was around 175F. That might mean that it was stalled, and not spinning, or that it was pumping air, and not being cooled by the water. I had one more spare pump, but I was getting suspicious.
I got to wondering if the pump I removed was really broken. I knew that these Taco 007-F5 pumps have a replaceable cartridge inside, but since the cartridge costs almost as much as a new pump I had never bothered with it. I decided to take apart the pump I removed to see what it looks like.
The pump housing is on the left. The impeller attached to the replaceable cartridge is in the center, and the motor proper is on the right. The impeller wasn’t jammed, but I wanted to know if it was working at all. I cut the cord off a broken lamp and used it to wire up the pump.
I was careful not to touch the pump when plugged in, because you will notice there is no ground. The impeller worked fine. Probably there was never anything wrong with the pump. While I had it set up like this, I measured 0.7 Amps current when running, which is what it should be. I then held on to the (plastic) impeller and turned it on. When stalled, the motor draw rose to 1.25 Amps. I now had a way to tell if a motor was stalled or spinning! The suspect zone was drawing .79 Amps, which probably means it was spinning, and the high temperature meant there was no water inside.
Around this time, Win called to ask me to go pick up firewood. While waiting I explained all this to Cathy. She has a PhD in Chemical Engineering, and has forgotten more about pipes and fluid flow that I will ever learn. She says “are the pumps self-priming?” Priming is the process of getting water into the pump so that they have something to pump. A self priming pump will pump air well enough to pull water up the pipe from a lower reservoir. A non-self-priming pump will not. These pumps are not self-priming. They depend on something else to get started. Cathy says “are the pumps below the reservoir level?” No. they are above the boiler. Cathy says “I would design such a thing with the pumps below the reservoir level, so they prime automatically”. Um, OK, but how does that help me? Cathy says “Turn off the top valve, take off the top flange and pour water in the top.” Doh.
I didn’t quite do that, because I remembered the geyser I got taking off the air vent. If I could let air out the top, water might flow in from below. All I did was loosen the bolts on the top flange a little. After about 10 seconds, I started getting water drops out of the joint, so I tightened the bolts and turned on the pump. Success! After a few minutes, the temperature gauge started to rise.
So probably my problems were too much air in the system all along.
On the way to buy a new air vent, I stopped at Win’s house to check his air vent, but we couldn’t find it! Either it is hidden away pretty well, which seems like a bad idea, or there isn’t one, which also seems like a bad plan. We’re puzzled, but he has heat. And now, so do I!
One heating zone still doesn’t work. The temperature gauge near the pump rises to 100, and the nearby pipes are warm, but the pipes upstairs (this is a second floor zone) are cold. I replaced the cartridge of the pump with the one I took apart the other day, and it spins, but there is no change. The pump is drawing current consistent with spinning. I loosen the top flange above the pump and water comes out. These symptoms are consistent with the pump spinning, and having water, but there is no flow all the way around the loop.
I took a detour to the Taco website and looked at the pump performance curves for the 007-F5, which are at http://www.taco-hvac.com/uploads/FileLibrary/00_pumps_family_curves.pdf. A pump has a certain ability to push water uphill. The weight of water above the pump more or less pushes back on the pressure generated by the pump. This height of water is called the “head”. A pump will pump more water against a lower head, and as the head is larger, the pump will deliver less and less water. Above a certain head it won’t work at all. According to the performance curves for my circulating pumps, their flow rate will drop to 0 at 10 feet of head. From the pump location to the distribution manifold in the wall behind the closet in the upstairs bedroom is about 18 feet. This pump cannot work if the pipe is not completely full of water. If both the supply pipe to the upstairs and the return pipe coming back are full of water, then because water is incompressible, the suction of the water falling down the return pipe will balance the weight of water in the supply pipe. If the pipe is full of air, as it likely is, then this pump is not powerful enough to lift water to the top.
The solution to this is to “purge” the air out of the pipes, by using some external source of pressure to push water into the supply end until all the air is pushed out of the return end. For this to work, the return end must be opened up to atmosphere, otherwise there’s no place for the air to go. (It will likely just get squeezed by the pressure, but there is no route for it to get to, for example, the air vent. I think you need a pretty high flow rate to do this, because the return pipe is 3/4 inch, and without a high flow rate, the air bubbles will float up against the downwards flow of water.
Some systems have air vents at the high points. Mine do not. This would help, because water would flow up both the supply and return pipes, lifted by the 10PSI system pressure. Since it only takes 7.8 psi pressure to lift water 18 feet, this would completely fill the pipes. Of course there would be a potentially leaky air vent inside the walls upstairs, to cause trouble in some future year. I don’t know if the lack of vents is sloppy installation or if one is supposed to use some other method of purging.
My system installation has no obvious (to me anyway) purge arrangements. To purge, you shut off valves on the boiler, put a hose from a valve on the return side into a bucket of water, and turn on external water on the supply side. When the host stops bubbling air, you are good to go.
In my system, makeup water comes from the house cold water pipes, through a backflow preventer and a pressure reduction valve to the hot water manifold. The return pipes from the zones flow to the boiler return manifold and then to the boiler. There is no master return shutoff, and no purge tap on the return maniforld. There is a drain tap on the boiler itself, and there is a tap between the boiler and a valve that can isolate the boiler from the hot water supply manifold. The pressure regulator has a little lever on the top that according to its user manual will open the regulator and let more water through for purging.
I could close the valve to isolate the boiler from the supply manifold, but then the purge water has to run all the way through the boiler to get to the outlet hose. I would lose all the hot water in the boiler.
But I have a missing pump! Years ago, I borrowed the pump from the zone that heats the study, and never put it back. I closed all the supply zone valves except the bad zone, and closed all the return valves except the bad zone and the study zone. I closed the main boiler output valve. At this point, the only path through the system was from the makeup water regulator, through the broken zone, to the return manifold, backwards into the study zone return pipe, through the cold side of the study zone mixing valve, and out the bottom flange of the not-present study pump.
I put a bucket under it and opened the bottom study zone pump valve. Water came out, but after a few gallons, I only get a trickle. I can hear hissing when I open the regulator toggle, but I suspect there is not enough flow to do effective purging. The setup is complicated, so I am not completely sure. In any case, this didn’t fix the not-working zone.
Next step: test the pressure regulator flow by closing all valves except makup water and the tap that is connected to the boiler outlet manifold. That will let me see the flow supplied by the regulator. I found an old backflow valve and regulator set on the floor. Evidently it was replaced at some point. The old one had a pretty clogged looking input screen, so perhaps that is the trouble with the current one as well. That wouldn’t affect normal operations because you don’t need makeup water unless there is a leak.
I propose a definition of Big Data. Big Data is stuff that you cannot process within the MTBF of your tools.
Here’s the story about making a backup of a 1.1 Terabyte filesystem with several million files.
A few years ago, Win and I built a set of home servers out of mini-ATX motherboards with Atom processors and dual 1.5 Terabyte drives. We built three, one for Win’s house, that serves as the compound IMAP server and such like, one for my house, which mostly has data and a duplicate DHCP server and such like, and one, called sector9, which has the master copy of the various open source SiCortex archives.
These machines are so dusty that it is no longer possible to run apt-get update, and so we’re planning to just reinstall more modern releases. In order to do that, it is only prudent to have a couple of backups.
In the case of sector9, it has a pair of 1.5 T drives set up as RAID 1 (mirrored). We also have a 1.5T drive in an external USB case as a backup device. The original data is still on a 1T external drive, but with the addition of this and that, the size of sector9’s data had grown to 1.1T.
I decided to make a new backup. We have a new Drobo5N NAS device, with 3 3T drives, set up for single redundancy, giving it 6T of storage. Using 1.1T for this would be just fine.
There have been any number of problems.
Idea 1 – mount the Drobo on sector9 and use cp -a or rsync to copy the data
The Drobo supports only AFP (Apple Filesharing Protocol) and CIFS (Windows file sharing). I could mount the Drobo on sector9 using Samba, except that sector9 doesn’t already have Samba, and apt-get won’t work due to the age of the thing.
Idea 2 – mount the Drobo on my Macbook using AFP, and mount sector9 on the Macbook using NFS.
Weirdly, I had never installed the necessary packages on sector9 to export filesystems using NFS.
Idea 3 – mount the Drobo on my Macbook using AFP and use rsync to copy files from sector9.
This works, for a while. The first attempt ran at about 3 MB/second, and copied about 700,000 files before hanging, for some unknown reason. I got it unwedged somehow, but not trusting the state of everything, rebooted the Macbook before trying again.
The second time, rsync took a couple of hours to figure out where it was, and resumed copying, but only survived a little while longer before hanging again. The Drobo became completely unresponsive. Turning it off and on did not fix it.
I called Drobo tech support, and they were knowledgeable and helpful. After a long sequence of steps, invoving unplugging the drives, and restarting the Drobo without the mSata SSD plugged in, we were able to telnet to it management port, but the Drobo Desktop management application still didn’t work. That was in turn resolved by uninstalling and reinstalling Drobo Desktop (on a Mac! Isn’t this disease limited to PCs?)
At this point, Drobo tech support asked me to use the Drobo Desktop feature to download the Drobo diagnostic logs and send them in….but the diagnostic log download hung. Since the Drobo was otherwise operational, we didn’t pursue it at the time. (A week later, I got a followup email asking me if I was still having trouble, and this time the diagnostic download worked, but the logs didn’t show any reason for the original hang.)
By the way, while talking to Drobo tech support, I discovered a weath of websites that offer extra plugins for Drobos (which run some variant of linux or bsd). They include an nfs server, but using it kind of voids your tech support, so I didn’t
A third attempt to use rsync ran for a while before mysteriously failing as well. It was clear to me that while rsync will synchronize two filesystems, it might never finish if it has to check its work from the beginning and doesn’t last long enough to finish.
I was also growing nervous about the second problem with the Drobo, that it uses NTFS, not a a linux filesystem. As such, it was not setting directory dates, and was spitting warnings about symbolic links. Symbolic links are supposed to work on the Drobo. In fact, I could use ln -s in a Macbook shell just fine, but what shows up in a directory listing is subtly different than what shows up in a small rsync of linux symbolic links.
Idea 4: Mount the drobo on qadgop (my other server, which does happen to have Samba installed) and use rsync.
This again failed to work for symbolic links, and a variety of attempts to change the linux smb.conf file in ways suggested by the Internet didn’t fix it. There were suggestions to root the Drobo and edit its configuration files, but again, that made me nervous.
At this point, my problems are twofold:
How to move the bits to the Drobo
How to convince myself that any eventual backup was actually correct.
I decided to create some end-to-end check data, by using find and md5sum to create a file of file checksums.
First, I got to wondering how healthy the disk drives on sector9 actually were, so I decided to try SMART. Naturally, the SMART tools for linux were not installed on sector9, but I was able to download the tarball and compile them from sources. Alarmingly, SMART told me that for various reasons I didn’t understand, both drives were likely to fail within 24 hours. They told me the external USB drive was fine. Did it really hold a current backup? The date on the masking tape on the drive said 5/2012 or something about a year old.
I started find jobs running on both the internal drives and the external:
These jobs actually ran to completion in about 24 hours each. I now had two files, like this:
root@sector9:~# ls -l *.md5
-rw-r--r-- 1 root root 457871770 2013-07-08 01:24 s9backup.md5
-rw-r--r-- 1 root root 457871770 2013-07-07 21:39 s9.md5
root@sector9:~# wc s9.md5
3405297 6811036 457871770 s9.md5
This was encouraging, the files were the same length, but diffing 450 MB files is not for the faint of heart, expecially since find doesn’t enumerate them in the same order. I had to sort each file, then diff the sorted files. This took a while, but in fact the sector9 filesystem and its backup were identical. I resolved to use this technique to check any eventual Drobo backup. It also relieved my worries that the internal drives might fail at any moment. I also learned that the sector9 filesystem had about 3.4 million files on it.
Idea 5: Create a container file on the Drobo, with an ext2 filesystem inside, and use that to hold the files.
This would solve the problem of putting symbolic links on the Drobo filesystem (even though it is supposed to work!) It would also fix the problem of NTFS not supporting directory timestamps or linux special files. I was pretty sure there would be root filesystem images in the sector9 data for the SiCortex machine and for its embedded processors, and I would need special files.
But how to create the container file? I wanted a 1.2 Terabyte filesystem, slightly bigger than the actual data used on sector9.
According to the InterWebs, you use dd(1), like this:
dd if=/dev/zero of=container.file block=1M seek=1153433 count=0
I tried it:
dd if=/dev/zero of=container.file block=1M seek=1153433
It seemed to take a long time, so I thought probably it was creating a real file, instead of a sparse file, and went to bed.
The next morning it was still running.
That afternoon, I began getting emails from the Drobo that I should add more drives, as it was nearly full, then actually full. Oops. I had left off the count=0.
Luckily, deleting a 5 Terabyte file is much faster than creating one! I tried again, and the dd command with count=0 ran very quickly.
I thought that MacOS could create the filesystem, but I couldn’t figure out how. I am not sure that MacOS even has something like the linux loop device, and I couldn’t figure out how to get DiskUtility to create a unix filesystem in an image file.
I mounted the Drobo on qadgop, using Samba, and then used the linux loop device to give device level access to the container file, and I was able to mkfs an ext2 filesystem on it.
Idea 6: Mount the container file on MacOS and use rsync to write files into it.
I couldn’t figure out how to mount it! Again, MacOS seems to lack the loop device. I tried using DiskUtility to pretend my container file was a DVD image, but it seems to have hardwired the notion that DVDs must have ISO filesystems.
Idea 7: Mount the Drobo on linux, loop mount the container, USB mount the sector9 backup drive.
This worked, sort of. I was able to use rsync to copy a million files or so before rsync died. Restarting it got substantially further, and a third run appeared to finish.
The series of rsyncs took several couple of days to run. Sometimes they would run at about 3 MB/s, and sometimes at about 7 MB/sec. No idea why. The Drobo will accept data at 11 MB/sec using AFP, so perhaps this was due to slow performance of the USB drive. The whole copy took close to 83 hours, as calculated by 1.1 T at 3 MB/sec.
Unfortunately, df said the container filesystem was 100% full and the final rsync had errors “previously reported” but scrolled off the screen. I am pretty sure the 100% is a red herring, because linux likes to reserve 10% of space for root, and the container file was sized to be more than 90% full.
I reran the rsync, under a script(1) to get a log file, and found many errors of the form “can’t mkdir <something or other>”.
Next, I tried mkdir by hand, and it hung. Oops. Ps said it was stalled in state D, which I know to be disk wait. In other words, the ext2 filesystem was damaged. By use of kill -9 and waiting, I was able to unmount the loop device and Drobo, and remount the Drobo.
Next, I tried using fsck to check the container filesystem image.
fsck takes hours to check a 1.2T filesystem. Eventually, it started asking me about random problems and could I authorize it to fix them? After typing “y” a few hundred times, I gave up and killed the fsck and restarted it fsck -p to automatically fix problems. Recall that I don’t actually care if it is perfect, because I can rerun rsync and check the final results using my md5 checksum data.
The second attempt to run fsck didn’t work either: root@qadgop:~# fsck -a /dev/loop0
fsck 1.41.4 (27-Jan-2009)
/dev/loop0 contains a file system with errors, check forced.
/dev/loop0: Directory inode 54583727, block 0, offset 0: directory corrupted
/dev/loop0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
Hoping that the fsck -a had fixed most of the problems, I ran it a third time again without -a, but I wound up typing ‘y’ a few hundred more times. fsck took about 300 minutes of CPU time on the Atom to do this work and left 37 MB worth of files and directories in /lost+found.
With the container filesystem repairs, I started a fourth rsync, which actually finished, transferring another 93 MB.
Next step – are the files really all there and all the same? I’ll run the find -exec md5sum to find out.
Um. Well. What does this mean?
root@qadgop:~# wc drobos9.md5 s9.md5
3526171 7052801 503407914 drobos9.md5
3405297 6811036 457871770 s9.md5
The target has 3.5 million files, while the source has 3.4 million files! That doesn’t seem right. An hour of running “du” and comparing the top few levels of directories shows that while rerunning rsync to finish interrupted copies works, you really have to use the same command lines. I had what appeared to be a complete copy one level below a partial copy. After deleting the extra directories, and using fgrep and sed to rewrite the path names in the file of checksums, I was finally able to do a diff of the sorted md5sum files:
Out of 3.4 million files, there were 8 items like this:
< 51664d59ab77b53254b0f22fb8fdb3a8 ./sicortex-archive/stash/97/sha1_97e18c8e2261b09e21b0febd75f61635d7631662_64088060.bin
> 127cc574dcb262f4e9e13f9e1363944e ./sicortex-archive/stash/97/sha1_97e18c8e2261b09e21b0febd75f61635d7631662_64088060.bin
and one like this:
> 8d9364556a7891de1c9a9352e8306476 ./downloads.sicortex.com/dreamhost/ftp.downloads.sicortex.com/ISO/V3.1/.SiCortex_Linux_V3.1_S_Disk1_of_2.iso.eNLXKu
The second one is easier to explain, it is a partially completed rsync, so I deleted it. The other 8 appear to be files that were copied incorrectly! I should have checked the lengths, because these could be copies that failed due to running out of space, but I just reran rsync on those 8 files in –checksum mode.
Final result: 1.1 Terabytes and 3.4 million files copied. Elapsed time, about a month.
What did I learn?
Drobo seems like a good idea, but systems that ever need tech support intervention make me nervous. My remaining worry about it is proprietary hardware. I don’t have the PC ecosystem to supply spare parts. Perhaps the right idea is to get two.
Use linux filesystems to hold linux files. It isn’t just Linus’ and his files that vary only in capitalization, it is also the need to hold special files and symlinks. Container files and loop mounting works fine.
Keep machines updated. We let these get so far behind that we could no longer install new packages.
A meta-rsync would be nice, that could use auxiliary data to manage restarts.
Filesystems really should have end-to-end checksums. ZFS and BTRFS seem like good ideas.
SMB, or CIFS, or the Drobo, or AFP are not good at metadata operations, it was a fail to try writing large numbers of individual files on the Drobo, no matter how I tried it. SMB read write access to a single big file seems to be perfectly reliable.
I am struggling here decide whether the Bradley Manning proscecutors are disingenuous or just stupid.
I am reacting here to Cory Doctorow’s report that the government’s lawyers accuse Manning of using that criminal spy tool wget. Notes from the Ducking Stool
I am hoping for stupid, because if they are suggesting to the jury facts they know not to be true, then that is a violation of ethics, their oaths of office, and any concept of justice.
Oh, and wget is exactly what I used, the last time I downloaded files from the NSA.
A while back, the back issues of the NSA internal newsletter Cryptolog were declassified so I downloaded the complete set. I think the kids are puzzled about why I never mind having to wait in the car for them to finish something or other, but it is because I am never without a large collection of fascinating stuff.
Here’s how I got them, after scraping the URLs out of the agency’s HTML: wget http://www.nsa.gov/public_info/_files/cryptologs/cryptolog_01.pdf
Cathy is off to China for a few weeks. She wanted email access, but not with her usual laptop.
She uses Windows Vista on a plasticy HP laptop from, well, the Vista era. It is quite heavy, and these days quite flaky. It has a tendency to shut down, although not for any obvious reason, other maybe than age, and being Vista running on a plasticy HP laptop.
I set up the iPad, but Cathy wanted a more familiar experience, and needed IE in order to talk to a secure webmail site, so we dusted off an Asus EEE netbook running Windows XP.
I spent a few hours trying to clear off several years off accumulated crapware such as three different search toolbars attached to Internet Explorer, then gave up and re-installed XP from the recovery partition. 123 Windows Updates later, it seemed fine, but still wouldn’t talk to the webmail site. It turns out that Asus thoughtfully installed the open source local proxy server Privoxy, with no way to uninstall it. If you run the Privoxy uninstall, it leaves you with no web access at all. I finally found Interwebs advice to also uninstall the Asus parental controls software, and that fixed it.
Next, I installed Thunderbird, and set it up to work with Cathy’s account on the family compound IMAP server. I wanted it to work offline, in case of spotty WiFi access in China, but after setting that up, so I “unsubscribed” to most of the IMAP folders and let it download. Now Cathy’s inbox has 34,000 messages in it, and I got to thinking “what about privacy?” After all, governments, especially the United States, claim the right to search all electronic devices at the border, and it is also commonly understood that any electronic device you bring to China can be pwned before you come back.
Then I found a setting that tells Thunderbird to download only the last so many days for offline use. Great! But it had already downloaded all 6 years of back traffic. Adjacent, there is a setting for “delete mail more than 20 days (or whatever) old.”
You know what happens next! I turned that on, and Thunderbird started deleting all Cathy’s mail, both locally and on the server. Now there is (farther down the page), fine print that explains this will happen, but I didn’t read it.
Parenthetically, this is an awful design. It really looks like a control associated with how much mail to keep for offline use, but it is not. It is a dangerous, unguarded, unconfirmed command that does irreversible damage.
I thought this was taking too long, but by the time I figured it out, it was way too late.
So, how to recover?
I have been keeping parallel feeds from Cathy’s email, but only since March or so, since I’ve been experimenting with various spam supression schemes.
I had made a copy of Cathy’s .maildir on the server, but it was from 2011.
But wait! Cathy’s laptop was configured for offline use, and had been turned off. Yes! I opened the lid and turned off WiFi as quickly as possible, before it had a chance to sync. (Actually, the HP has a mechanical switch to turn off WiFi, but I didn’t know that.) I then changed the username/password on her laptop Thunderbird to stop further syncing.
Next, since the horse was well out of the barn, I made a snapshot of the server .maildir, and of the HP laptop’s Thunderbird profile directories. Now, whatever I did, right or wrong, I could do again.
Time for research!
What I wanted to do seemed clear: restore the off-line saved copies of the mail from the HP laptop to the IMAP server. This is not a well-travelled path, but there is some online advice: http://www.fahim-kawsar.net/blog/2011/01/09/gmail-disaster-recovery-syncing-mail-app-to-gmail-imap/ https://support.mozillamessaging.com/en-US/kb/profiles
The general idea is:
Disconnect from the network
Make copies of everything
While running in offline mode, copy messages from the cached IMAP folders to “Local” folders
Reconnect to the network and sync with the server. This will destroy the cached IMAP folders, but not the new Local copies
Copy from the Local folders back to the IMAP server folders
Seems simple, but in my case, there were any number of issues:
Not all server folders were “subscribed” by Thunderbird, and I didn’t know which ones were
The deletion was interrupted at some point
I didn’t want duplicated messages after recovery
INBOX was 10.3 GB (!)
The Thunderbird profile altogether was 23 GB (!)
The HP laptop was flakey
Cathy’s about to leave town, and needs last minute access to working email
One thing at a time. Tools
I found out about “MozBackup” and used it to create a backup copy of the HP laptop’s profile directory. MozBackup
MozBackup creates a zip file of the contents of a Thunderbird profile directory, and can restore them to a different profile on a different computer, making configuration changes as appropriate. This is much better than hand editting the various Thunderbird configuration files.
As I mentioned, the HP laptop is sort of flakey. I succeeded in copying the Thunderbird profile directory, but 23 GB worth of copying takes a long time on a single 5400 rpm laptop disk. I tried copying to a Mybook NAS device, but it was even slower. What eventually worked, not well, but adequately, was copying to a 250GB USB drive.
I decided to leave the HP out of it, and to do the recovery on the netbook, the only other Windows box available. I was able to create a second profile on the netbook, and restore the saved profile to it, slowly, but I realized Cathy would leave town before I finished all the steps, taking the netbook with her. Back to the HP.
First I tried just copying the IMAPMail subfolder files of mbox files and msf files to LocalFolders. This seemed to work, but Thunderbird got very confused about it. It said there were 114000 messages in Inbox, rather than 34000. This shortcut is a dead end.
I created a new profile on the HP, and restored the backup using MozBackup (which took 2 hours), and started it in offline mode. I then tried to “select-all” in Inbox to copy them to a local folder. Um. No. I couldn’t even get control back. Thunderbird really cannot select 34000 messages and do anything.
Because I was uncertain about the state of the data, I restored the backup again (another 2 hours).
This time, I decided to break up Inbox into year folders, each holding about 7000 messages. The first one worked, but then the HP did an undexpected shutdown during the second, and when it came back, Inbox was empty! The Inbox mbox file had been deleted.
I did another restore, and managed to create backup files for 2012 and 2011 messages, before it crashed again. (And Inbox was gone AGAIN)
The technique seemed likely to eventually work, but it would drive me crazy. Or crazier.
I was now accumulating saved Local Folder files representing 3 specific years of Inbox. I still had to finish the rest, deal with Sent Mail, and audit about 50 other subfolders to see if they needed to be recovered.
I wasn’t too worried about all the archived subfolders, since they hadn’t changed in ages and were well represented by my 2011 copy of Cathy’s server .maildir
What about server backups? Embarassing story here! Back in 2009, Win and I built some nice mini-ATX atom based servers with dual 1.5T disks run in mirrored mode for home servers. Win’s machine runs the IMAP, and mine mostly has data storage. Each machine has the mirrored disks for reliabiltiy and a 1.5T USB drive for backup. The backups are irregularly kept up to date, and in the IMAP machines case, not recently.
About 6 months ago, I got a family pack of CrashPlan for cloud backup, and I use it for my Macbook and for my (non IMAP) server, but we had never gotten around to setting up CrashPlan for either Cathy’s laptop or the IMAP server.
A few months ago, we got a Drobo 5N, and set it up with 3 3T disks, for 6T usable storage, but we haven’t gotten it working for backup either. (I am writing another post about that.)
So, no useful server backups for Cathy’s mail.
Well now what?
I have a nice Macbook Pro, unfortunately, the 500 GB SSD has 470 GB of data, not enough for one copy of Cathy’s cached mail, let alone two. I thought about freeing up space, and copied a 160 GB Aperture photo library to two other systems, but it made me nervous to delete it from the Macbook.
I then tried using Mac Thunderbird to set up a profile on that 250 GB external USB drive, but it didn’t work because the FAT filesystem couldn’t handle Mac Thunderbird’s need for fancy filesystem features like ACLs, but this triggered an idea!
First, I was nervous about using Mac Thunderbird to work on backup data from a PC. I know that Thunderbird profile directories are supposed to be cross-platform, but the config files like profile.ini and prefs.js are littered with PC pathnames.
Second, the USB drive is slow, even if it worked.
Up until recently, I’ve been using a 500 GB external Firewire drive for TimeMachine backups of the Macbook. It still was full of Time Machine data, but I’ve switched to using a 1T partition on the Drobo for TimeMachine. I also have the CrashPlan backup. So I reformatted the Firewire Drive to HFS, and plugged it in as extra storage.
Also on the Macbook, is VMWare Fusion, and one of my VMs is a 25 GB instance running XP Pro.
I realized I should be able to move the VM to the Firewire drive, and expand its storage by another 50 GB or so to have room to work on the 23 GB Thunderbird data.
To the Bat Cave!
It turns out to be straightforward to copy a VMWare image to another place, and then run the copy. Rather than expand the 25GB primary disk, I just added a second virtual drive and used XP Disk management to format it as drive E. I also used VMWare sharing to share access to the underlying Mac filesystem on the Firewire drive.
Copy VMWare image of XP to the Firewire drive
Copy MozBackup save file of the cached IMAP data and the various Local Files folders to the drive
Create second disk image for XP
Run XP under VMWare Fusion on the Macbook, using the Firewire drive for backing store
Install Thunderbird and MozBackup
Use Mozbackup to restore Cathy’s cached local copies of her mail from the flakey HP laptop
Copy the Local Files mailbox files for 2013, 2012, and 2011 into place.
Use XP Thunderbird running under VMWare to copy the rest of the cached IMAP data into Local Folders.
By hand, compare message counts of all 50 or so other IMAP folders in the cached copy with those still on the server, and determine they were still correct.
Go online, letting Thunderbird sync with the server, deleting all the locally cached IMAP data.
Create IMAP folders for 2007 through 2013, plus Sent Mail and copy the roughly 40000 emails back to the server.
During all of this, new mail continued to arrive into the IMAP server, and be accessible by the instance of Thunderbird on the netbook.
A copy of Cloudmark Desktop One was active running on the Macbook using Mac Thunderbird to do spam processing of arriving email in Cathy’s IMAP account.
My psyche is scarred, but I did manage to recover from a monstrous mistake.
RAID IS NOT BACKUP
The IMAP server was reliable, but it didn’t have backups that were useful for recovery.
Don’t think you understand what a complex email client is going to do
Don’t experiment with the only copy of something! I should have made a copy of the IMAP .maildir in a new account, and then futzed with the netbook thunderbird to get the offline use storage the way I wanted.
Quantity has a quality all its own.
This quote is usually about massive armies, but in this case, the very large email (23 GB) just made the simplest operations slow, and some things (like selecting ALL in a folder with 34000 messages, impossible.) I had to go through a lot of extra work because various machines didn’t have enough free storage, and had other headaches because the MTBF of the HP laptop was less than the time to complete tasks.
I used Splasho’s “Up-Goer Five Text Editor.” to write what I do, using only the most common 1000 words in English
In my work I tell computers what to do. I write orders for computers that tell them first to do this,and then to do that, and then to do this again.
Sometimes the orders tell the computer to listen for other orders from people. Then the orders tell the computer how to do what the people want, and then the orders tell the computer to show the people what the answer is.
I used to build computers. I would take one part, and another part, and many more parts, and put them together in just the right way so the computer would work right. Computers are all the same, they listen for an order, then do what it says, then listen for another order. We use them because they do this thing very very very very fast.