AT&T Hell

Summary – AT&T customer service gives you bad information, tries to fix it and can’t, then lies about how it is “impossible”.
Update summary – Twitter works!  AT&T twitter team seems to have fixed the remaining problem.
“We don’t care, we don’t have to.” – Lily Tomlin
When I worked for IBM one summer, I wore a tie every day to see if I could do it.
When I drove an RX-7 in Palo Alto, I obeyed all the speed limits, to see if I could do it.
Last month I gave up my iPhone, to see if I could do it.
My daughter wanted an iPhone, but she’s in the middle of a two year contract on T-Mobile with a Palm Pixi.  My iPhone 4S is in the middle of a two year contract with AT&T that started October 2011.  It had the grandfathered unlimited data plan, and would be up for upgrade eligibility in May 2013.
On December 26, I called AT&T to see if I could port my number out and get a new number assigned to the iPhone, so I could let my daughter use it, while I would keep the T-Mobile phone, but with my number.  My number started out life a long time ago as a Verizon landline, with the number sequential to our home phone, so I am attached to it.  It is also on all my business cards and in countless contact lists.
AT&T said “sure”, when you port the number out, we’ll assign a new number to the iPhone and the contract will remain unchanged.
Life was good!  The daughter is happy, and I have a phone that is, um, interesting.  I also have an iPad, so don’t shed any tears about that!
A week or so later, we notice that the bill is $400ish.  There is an early termination charge on there!  You can’t actually figure out what the charge is from the online presentation.  You have to hunt up the pdf and look at the image of the printed bill.  This is a phone company, they know how to print phone bills, not how to build websites.
On the phone with customer service.  “When you ported out the number, that cancels your contract, and you get an early termination fee. Then you added a new line with new contract dates.”  I explained my call on the 26th, and the agent said, oh, well I can waive the early termination fee and make the contract be as it was. The only thing I can’t do is preserve the unlimited data plan.  So now the phone is on the 3GB plan.   I thought about balking, that unlimited plan made me feel like an old-time iPhone user, more privileged than the unwashed masses, but really, my usage is about 250 MB per month.  The iPad has a bigger screen.  So I let it slide.
A few days later, a website check showed the fees gone.  I noticed that the upgrade availabiltiy wording was different for this phone than for the other iPhone line, which also started October 2011, but decided to wait to see if other changes would catch up before calling.
A few days later, no change.  Called and learned that the second agent had waived the fees, but not fixed the contract dates.  I was assured that all would be fixed, and notes put in the account.
A few days later, no change to the upgrade language.  On calling, I was told that the contract would expire October 2013, as expected, but the upgrade eligibility date was July 2014.  What does that even mean?  After the contract is over, I can just create a new line, with a new contract and phone, and port the number!  It makes no sense to have an upgrade eligibility after the contract expiration.  Anyway, this is just stupid.  I explained that I had been told “the contract would be as it was” but the agent said there was just no way to change that in his system, the upgrade eligibility is tied to the phone number, not to the contract.
[By the way, this is also a lie, because, for example, if you are being stalked, you can request a new number and get it without any such collateral damage.]
I asked for a supervisor, who said
This should never have been allowed in the first place.  You can’t port out a number and keep the contract. It is our number.  The agents who tried to “fix” it for you went way outside our policies and made it worse.  What they should have done to correct their original mistake was to port your number back in, not to try and fix the contract. It can’t be fixed, it is impossible to change an upgrade eligibility date. It is tied to the phone number.
The supervisor said there were no higher supervisors to talk to, and no physical mail address to send a complaint to.
Well.  This supervisor was certainly polite, but either was really unable to fix the problems that AT&T created, or unwilling to do so.
At the moment, I have a nice iPhone, with a pleased daughter, but I am not pleased.  I made a perfectly sensible request.  I was told “Yes, of course you can do that” and now the account is scrambled beyond belief.
Recapping

  •  iPhone 4S, 14 months into a 24 month contract.
  • I ask to port out my number, and get a new number assigned to the phone,without contract changes.  I’m not paying them any less, I am not getting a new phone, just changing a few bits in a database somewhere about what is the number!
  • AT&T says “yes”
  • AT&T charges an early termination fee, an activation fee, cancels my unlimited data plan, restarts the 2 year contract, and resets the upgrade eligibility data.  I am not even angry about the activation fee, they deserve some fee for the work.
  • I complain.  AT&T waives the early termination fee, promises to fix the contract, but doesn’t
  • I complain.  AT&T promises to fix the contract, but only fixes the contract termination dates, the upgrade date is now 9 months after the contract expires.
  • I complain.  AT&T says “impossible to fix”
  • AT&T supervisor says “impossible to fix, and there is noone higher than me to ask”
The only thing that an upgrade date after contract expiration might mean is that AT&T would refuse to unlock the phone until it is 2 3/4 years old.  That would piss me off, but I don’t even want to ask them right now.

And by the way, the iPhone battery doesn’t work as well as it used to, and that 18 month upgrade was starting to look pretty attractive!  Instead, I will likely have to pay Apple $79 to fix it.  At least that is cheaper than the $99 Applecare I forgot to get, if nothing else goes wrong with the phone.
Now I am not a phone company marketing person, but I think I understand the essential economics of subsidized phones.  AT&T gives a substantial discount on the phone in trade for a contract commitment.  In fact, this is still a worse deal for the customer than buying an unlocked phone on a carrier with cheaper plans, like Virgin or T-Mobile, but AT&T doesn’t discount the monthly charges if you bring your own device.  That is just another way to screw the consumer.  So with AT&T, you may as well get the subsidy if you don’t mind sticking around for two years.  And they really make their money back so quickly that they let you upgrade (and restart the two year clock) after 18 months.
This is a simple deal – AT&T discounts the phone, I promise to keep paying their (high) monthly bills for two years.  This has nothing to do with the phone number!  Changing the number has utterly no effect on the money flows.
What about that number?  AT&T says it is their number, they can attach whatever they want to it.  But that is not true.  I had the number with Verizon. I ported the number to AT&T, I ported it out.  The FCC has “local number portability”.  The numbers are managed by CLECs (I think that is the term of art for phone companies) but they really can’t be taken away from users except for some arcane technical reasons.
What has happened here?  It cannot be “impossible” to fix these sorts of problems.  There may be software limitations, but those are fixable.  Or they could merely write a note to themselves saying “Yes, the system says this contract runs until July 2014, but when the customer asks, in May 2013, for an upgrade, just waive the fees.  And when the customer cancels the contract in October 2013, waive any cancellation fee.”
Instead, they’ve spent a lot of money on customer service phone calls, which are not cheap. They’ve enraged a long-standing customer who has alternatives. They’ve provided more information to the entire internet about just how bad their service and systems are.  There is no good result for AT&T here. They’ve not gained any income. They haven’t kept control of their precious number. They may well lose me as a customer come October.  (That Nexus 4 on T-Mobile is looking pretty good, or a nice unlocked iPhone 5S or whatever.) And they are defending positions and policies that make no sense competitively or economically.
I’m not sure of the next step for me.  Probably I will tweet the URL of this blog entry to @ATTCustomerCare.  At this point, AT&T can fix the problems, or they can provide me a source of continuing amusement.  There’s a rumor that sometimes people get results by writing the CEO.  At a minimum that will cost them even more money to deal with my letter.
UPDATE – I tweeted this URL to @ATTCustomerCare and they actually answered, got me on the telephone, and fixed this, well enough.  Which is to say they can’t fix it in the database, but they’ve added a special note telling other folks to honor an upgrade request on or after the correct date.  Works for me.  (1/16/2013)
You can sort of understand how enterprise software can become unwieldy, to the point where it seems easier to correct software problems and poor specifications by adding layer upon layer of special fixes and exceptions and end-runs, but it is not good for customers or efficiency to do it that way.
 
 
 
 

Buying a lemon

Last month we got a shiny new Stop and Shop grocery store here in Wayland.  They’ve been having various grand opening specials so we have been dropping by.  I went over there Sunday evening to buy blueberries (two pints for $3! in January!) but they were out of stock.  I managed to leave the shopping list at home, so I had to go by my wits, which is really not such a good idea.
I checked out using the ScanIt! gadget, and this time I remembered to wait for the coupon accepted tone before dropping my coupon in the slot.  Last time I had to have staff fish my should-have-worked coupon out of the guts of the machine and fix it, but I digress.
After finishing, I called Cathy to see what I had forgotten and she told me to remember to get a lemon and to get a rain check for the 10/$10 frozen vegetables they had run out of.  (I already had a rain check for the blueberries).
I didn’t get another ScanIt! machine for one lemon, so I went over to produce and picked out a nice lemon.  66 cents each!  Should be 50. I carefully put it on the scale, typed in the produce code, and entered my quantity,  The machine prints a scannable sticker, which I stuck on the lemon.
At the self-checkout I scanned the lemon and touched “pay”.  While the machine thought about it, I got exact change from my wallet and began to feed in coins.  Around about 55 cents, I noticed the amount due was $4.03.  There was no cancel button.  At that point I looked at the lemon, and the sticker said “7” rather than “1”.   I think the produce machine must have a calculator style keypad, with 7 at the upper left, rather than a phone keypad with 1 at the upper left.
I think this is 1200 baud modem training to blame.  In those days, you typed way ahead of the computer, and since you knew what it was going to do, there was no real need to actually look at the screen when it caught up.
At this point, there was nothing to do but press  the I Need Help button and look sheepish.
A nice girl with bright orange hair came over and I explained.  I think this was a new one.  She scanned her superuser card and after flipping through some screens said “I don’t think there is any way to change an order after you start paying…. But I can refund the money.”
[Side note: The machine refunded a different collection of coins that happened to add up to 55 cents, rather than returning my coins.  I suppose this lets you overload the change and the refund mechanism.]
After she left, I entered 1 lemon through the produce lookup screens, and again hit pay, and started putting my coins in.  This time, after a few coins, the machine said $5 something or other to go.  I had done it again!  Evidently the 7 virtual lemons were still on the tab, as well as the one real lemon.  I had to call for help again.
The same girl with the bright orange hair came over, and apologized to me, apparently for my being an idiot, and this time refunded the money, and deleted the 7 lemon line item, leaving only one lemon.  I successfully paid, and fled.
It is a mixed blessing that the store was essentially deserted.  No one was there to watch my performance, but neither was there any press of work to distract the staff from chuckling over the befuddled customer.
And I forgot to get the rain check for the frozen peas.

Aaron Swartz

Aaron Swartz, 26, committed suicide the other day, evidently hounded to his death by overzealous prosecutors.
I didn’t know Mr. Swartz, and I don’t condone his actions of a couple of years ago, where it is alleged that he attached equipment to the MIT computer network to steal academic articles from the JSTOR database in order to release them to the public.
However, the more I learn about the conduct of the government in prosecuting Mr. Swartz, the angrier I get.
For those lacking any context, go read what Larry Lessig had to say in

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

or what Cory Doctorow had to say in

http://boingboing.net/2013/01/12/rip-aaron-swartz.html

Here is the letter I’ve sent to my Senator, Elizabeth Warren.  I’ve sent a similar letter to Sen. John Kerry

I call to your attention the recent suicide of Aaron Swartz.  It looks
very much to me like the US Justice Department hounded him to his
death by overzealous prosecution of a victimless “crime” if it even was
a crime.

Larry Lessig writes on the case:
http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully
I would like to know what you are doing to hold the prosecutors and
their bosses at Justice to account for this affair.
I voted for you in part for your history of representing the issues
of ordinary people against big business.  Please also represent us
against the oppressive power of government.
-Larry Stewart
I’ve sent the following email to Rafael Rief, President of MIT

I understand that the Swartz affair started before you became president of MIT, but I think you should explain to the community what happened, why it happened, and exactly what principles MIT holds.

From what I’ve heard, MIT provided the pretext necessary for the US Attorney ****** to hound Aaron Swartz to his death.

 See, for example, Larry Lessig’s account at

http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

It may well be that Mr. Swartz was guilty of something, and it may be that MIT favored prosecution, but once MIT started such a ball rolling MIT became responsible in part for the damage it caused.  At the minimum, MIT had an obligation to track the case and to speak out loudly when it began to go off the rails of proportional justice in such a dramatic way.

-Larry Stewart ’76

(name removed because I am not sure I got it right)

I don’t know what the right answers are in this case, but I am beginning to think we should handle failures of justice in the same way we handle airplane crashes.  Do we need an equivalent of the National Transportation Safety Board to investigate?  Such a group could find out what happened, why it happened, and what legal, procedural, training, and technical measures are needed to keep it from happening again.  And their reports and proceedings should be open.
We now have so many laws and crimes, and so many are ill-defined, that likely everybody is “guilty” of something.  When the full oppressive power of government can be brought to bear on anyone at the discretion of individuals or groups with their own agenda, then no one is safe.

 UPDATE

About an hour after I wrote to MIT President Reif, he wrote to the community.  Obviously he’s well ahead of me on this one, since his message must have already been in progress.   Professor Hal Abelson will be leading a thorough analysis of MIT’s involvement.  I await the report with interest.
http://web.mit.edu/newsoffice/2013/letter-on-death-of-aaron-swartz.html
 
 
 
 

Another thing not to do

At the day job, I’ve been writing a new version of nbd-client.  Instead of handing an open tcp socket to the kernel, it hands the kernel one end of a unix domain socket and keeps the other end for itself.  This creates a block device where the data is managed by a user mode program on the same system.
In regular nbd-client, the last thing the program does is call ioctl(fd, NBD_DO_IT), which doesn’t return.  The thread is used by the device driver to read and write the socket without blocking other activity in the block layer.
Because I need the program around to do other work, I called pthread_create to make a thread to call the ioctl.
Then I ran my program under gdb (as root!).
In another window, I typed dd if=/dev/nbd0 bs=4096 count=1
In the gdb window I saw
nbd-userland.c:525: server_read_fn: Assertion `0′ failed.
and my dd hung, and the gdb hung, and neither could be killed by ^C
I was able to get control back by using the usual big hammer, kill -9 <gdb>
So what happened?  My user mode thread hit an assertion, and gave control to gdb, which tried to halt the other threads in the process, which didn’t work because the thread in the middle of the ioctl was in the middle of something uninterruptible, and the gdb thread trying to do this also became uninterruptible while waiting.
It is going to be hard to debug this program like this.
The fix, however, is fairly clear:  use fork(2) instead of pthread_create() to create a thread to call ioctl. It will be isolated from the part of the program hitting the assertion.
Older and wiser,
Larry
By the way, when you are trying to figure out where processes are stuck, look at the “wchan” field of ps axl.  It will be a kernel symbol that will give you a clue about what the thread is waiting for.
UPDATE
Experience is what lets you recognize a mistake when you make it again.
The underlying bug was sending too much data on the wire.  Like this:
struct network_request_header {
uint64_t offset;
uint32_t size;
};
write(fd, net_request, sizeof(struct network_request_header);
Well, no.  sizeof(struct network_request_header) turns out to be 16, rather than, say, 12.  If you think about it, this makes perfect sense, because otherwise an array of these things would have unaligned uint64_t’s every other time.  You can’t do network I/O this way, especially if the program on the other end uses a different language or different compiler.
gdb, it turns out, has a feature:  __attribute__((packed)) that makes this work, but it is not portable to other compilers.

Home Networking Troubleshooting

Sometimes a technological scramble is triggered by the most mundane events.  In this case, the season finale of “X Factor”.
Last night, there was a special church choir rehearsal for the Christmas Eve services, and all seven of Win’s and my kids went.  Since the rehearsal would overlap the broadcast finale of X Factor, Erica asked Win to record it.  Maybe the appearance of 1 Direction had something to do with it as well.
We used to have Replay TVs to solve things like this, and cable TV to deliver the bits, but the conversion to digital TV and the crazy anti-customer behavior of Comcast has changed all that.  We don’t get cable, and the TV is hooked up to an antenna.  We’ve also got a Silicon Dust HDHomeRun network tuner connected to the antenna on my front porch, so we can watch TV on any computer as well.  Win has the copy of EyeTV that came with the HDHomeRun, and he planned to record the show.
About an hour before air time, he called to ask me about video artifacts and bad audio.   I said I’d take a look.
I used hdhomerunner (a now lost Google Code project to develop an open source HDHomeRun control program) and directed the video to VLC running on my Macbook Pro.  Indeed, the video was blocky and the audio spotty.
I power cycled the HDHomeRun, replaced the ethernet cable, and plugged it into a different switch port on the 16-port gigE switch.  No change.  I looked for firmware upgrades, and found the device running 4-year old firmware.  The upgrade went smoothly, but there was no change in video quality.
After sitting and swiveling back and forth for a while, I went back downstairs and plugged the device into the 100 Mbps switch instead of the 1000 Mbps switch.  I had some vague memory that the negotiation doesn’t always work right.  This fixed the problem and I was able to watch good video and audio with VLC.
Win called back to report his video was still breaking up.  This suggested some other networking problem between the houses.
Backgound.  Win and I are neighbors, and we have a conduit between the houses with a couple of outdoor rated Cat V cables and a 6-fiber multimode fiber.  One pair of fibers are connected to 1000base-SX media converters at the two ends and plugged into the house gigE switches.
I remembered once setting up netperf on the home servers, and indeed it was still installed.  Win’s house to mine reported 918 Mbps, but mine to Win’s reported 16! At this point, there wasn’t much time to debug the networking, and X Factor was about to start.
I remembered that VLC can record an  input video stream, and set that up to record the program on my Macbook.  (I had 45 GB free on disk, and the program was running at 2 Megabytes/second, so it would take 14 GB for the two hours.  No doubt there is a way to transcode, but not enough time to learn how to do it!)
The VLC recording froze once, at about the one hour point, but I only missed a couple of minutes.  I copied the files to an external USB drive for sneakernet delivery.
This morning, Win and I started taking a look at the networking.  First, we got netperf running on our respective Macbook and iMacs, in order to figure out if the link was bad or one of the home servers.  I was able to talk both ways to my server at about 600 Mbps, and Win to his at about 95 Mbps.  Win’s results are explained by a fast Ethernet hop somewhere, but all these rates are way above the pitiful 16 Mbps across the fiber.
Next Win wiggled his connectors, dropping the path to about 6 Mbps.  We swapped the transmit and receive fibers at both ends, and the direction of the problem did not change.  It was looking more and more like a bad media converter.
I was staring at the wiring in my basement, wondering if we could use the copper link as backup while waiting for parts.  It never worked very well, but we did use it to cross connect our DMZs before the firewalls at one point.  I found the cable, and found it plugged into the ethernet switch on the back of my FIOS router – with LINK active!  Huh?  What was it plugged into at Win’s end?  He reported it plugged into a small switch, but that it wasn’t easy to tell what else was plugged in.
For experiment, we unplugged the copper link and … Win lost Internet access.  Evidently (a) his routes were set to use the Serissa business FIOS rather than his home Comcast, and (b) the traffic was going over this moldy waterlogged CatV instead of our supposedly shiny gigabit fiber.  Now the gears are turning.  If we did have a loop in the switch topology, then it was entirely possible that one direction between the houses would use the fiber while the other direction would use the copper.  I don’t know much about how these cheap switches figure out things like that.    We tried unplugging the fiber, forcing all traffic onto copper, but the netperf results were much worse.  ping seemed to work, and ping -c 1000 gave fairly good results, but ping -c 1500 had a lot of trouble.  That would explain why, generally, ping and ssh seemed to work but netperf gave bad results.
We unplugged the copper and plugged the fiber back in, and after a few seconds, the asymmetrical performance resumed.  I’ve placed an order for another media converter, and we’ll see if that fixes it.  At least they now cost half as much as when we got the first pair!
So, there was a lot going on here.
The hdhomerun was plugged into a gigabit switch, and working poorly.  Changing to fast Ethernet fixed that.
The topology loop was routing off-site traffic over a poor copper link, but it was working well enough that we didn’t notice.
The media converter is probably bad, working well in one direction but not the other, and probably that explains the poor video quality .
And Erica gets to watch 1 Direction.
How are just plain folks supposed to figure this stuff out?
UPDATE
The new media converter arrived… and didn’t fix the problem.  Well we have a spare now!  The actual problem was a bad 8-port switch in Win’s basement, which we belatedly figured out once ruling out the fiber.  We could have tested the link standalone by plugging computers into both ends, but we did’t think of it.  Does gigE need crossover cables to do that? Or is the magic echo cancellation make crossover cables unneccesary?
 

The Fault in our Stars

The Fault in our Stars is the new book by John Green.  You will likely find it in the children’s section of your local library, as it is usually filed under Youth.
I think you should go read it.  Buy a copy, or check it out of the library, or borrow one from a friend, or see if your middle school english teacher has a copy.
Actually I think you should go read all of John Green’s books.  Looking for Alaska, Paper Towns, An Abundance of Katherines, Will Grayson, Will Grayson, and The Fault in our Stars.  I like them all.
The Fault in our Stars is a love story, a search for meaning. Its about the survivors, at least temporarily, of cancer, and about those whose survival is even more temporary. Is the purpose of our existance to do things? Or is the purpose of our consciousness to pay attention to what is around us?  All of our times are limited, but there are still an infinite number of moments in each day, to be used as best we can.
PS.  If you don’t believe me, (and who would?) go read some of the reviews on Amazon.
 

Comcast DigitalNow

Comcast has not achieved even my low expectations.
We get “Limited Basic” cable, which means we get the local broadcast channels plus a couple of things like New England Cable News. After the conversion to broadcast Digital, Comcast was sending these channels in analog (SD) format, but also sending QAM unencrypted digital versions of them.  The local broadcast HD signals were also being delivered.  Since we have a modern TV with NTSC, QAM, and ATSC tuners, life was fairly pleasant.  Our old Replay TV works with the analog channels, and we can watch the digital or HD digital versions live.
Now, Comcast is completing their “digital conversion” and taking away the analog signals.  This is fine.  They are providing “Digital Transport Adapters” so you can convert the digital versions (SD) to analog on channel 3, for analog sets or our old ReplayTV.
However, they are (evidently) also taking away, or encrypting, the local broadcast HD channels!  The only way to continue to receive them is to get a Digital Set Top Box, for which they will charge an extra $10/month.
Hello!  The HD signals are broadcast and free.  The real effect of this will be for us to drop Comcast altogether in favor of over the air HD.  Our ReplayTV will stop working, but I have a nice HDHomeRun digital tuner, and I can assemble a (free) MythTV box to record things.  I had this working already, but converted that box for the kids to do gaming.  I can make another one.  And it will work for the internet streaming TV as well: Netflix, crackle, hulu, youtube, etc.
Comcast’s only value to me was convenience, and they are making it rather less convenient and rather more expensive.
The right thing to do is just to transcode the local broadcast signals from ATSC to QAM, and to leave them alone. No DTA, no extra box, no extra remote.
 
 
 

Elegy for a boot

I had not been skiing since 1990 or so.  When I moved to New England in 1989, I thought that skiing here was just too darn icy and cold, compared to Lake Tahoe.  I put my ski boots and other gear in a box in the basement.  This year, my 12 year old son announced he wanted to try snowboarding, so during winter vacation week we went to Nashoba Valley here in Massachusetts.  It was what we used to call “spring skiing” conditions.  Alex had a good time and might want to go again, but here’s what happened when I hit a bump on my first intermediate run.
It looks to me like the plastic just got brittle and shattered. Now I am kind of bemused.  What happened really?  I know that plastics get brittle when exposed to ultraviolet light, but that is not the case here.  Perhaps this is an example of the plasticizers evaporating over many years.
Anyway, farewell boots!  You served me well, at Squaw Valley, Alpine Meadows, Kirkwood, Soda Springs, Heavenly Vally, Boreal, and farther afield at Mammoth Mountain, Snowbird, and Sun Valley.  Even Waterville Valley was no match for you, but little Nashoba was too tough.

Broken ski boots
My 1985 Ski Boots

SOPA and ProtectIP Followup

 
I wrote to both my senators, Kerry (D) and Brown (R) about SOPA and ProtectIP.  I sent substantially the same letter to both:

I urge you to vote against SOPA/ProtectIP.

This pernicious legislation would give the government the power to shut down websites and internet domains with no evidence, no due process, and no redress, essentially at the behest of private interests.

Even without this new legislation, the government is <already> seizing domains without due process. In a recent example, a domain was seized and not returned for a year, in violation of numerous “policies” without any opportunity for the people whose property was seized to confront their accusers or even learn the charges. In the end it turned out there was no evidence at all.

SOPA and ProtectIP will make the current intolerable overreach of the US Government with respect to the internet immeasurably worse.
-Lawrence Stewart, PhD
Software Engineer

I sent my senators an email.  Others sent cash.  According to http://sopatrack.com/state/massachusetts, Sen. Kerry received $358,270 from pro-PIPA groups and $403,422 from anti-PIPA groups (plus $4,485,003 from big media generally), and Sen. Brown received $473,745 from pro-PIPA groups and $152,173 from anti-PIPA groups.  It’s hard to draw any conclusion from the money flow except that Kerry is more senior.
I have now received answers from my senators.  Here they are:
From Senator John Kerry <senator@kerry.senate.gov

Dear Dr. Stewart:

Thank you for your letter regarding the Preventing Real Online Threats to Economic Creativity and Theft of Intellectual Property Act (PROTECT IP Act).  I appreciate hearing from you on this important issue.

I have long championed the cause of innovation and an open Internet.  Firms operating on and off the Internet strongly rely on intellectual property laws to help protect their investments and ensure a just return for their goods and services.  Online piracy and copyright infringement hurts our economy and costs American businesses more than 200 billion dollars a year.  Many infringers operate from foreign countries in order to avoid US law enforcement.  As a result, under current law, American authorities are limited in what they can do to bring these rogue sites to justice.

As you know, the PROTECT IP Act was intended to protect American businesses from intellectual property theft on foreign websites.  Among other things, the bill would provide the Attorney General with the authority to seek a court injunction against a foreign website that engages in copyright infringement.  The court could also require U.S. websites to block access to websites found to be dedicated to infringing activities.  For example, search engines could be required to disable links to the website that is found to be violating copyright of a US company.

However, there are a number of serious and legitimate concerns regarding the scope of the legislation, as well as the potential for abuse, censorship, or other unintended consequences.   The authors recognize the legislation still needs work and I will oppose any proposal that would fundamentally undermine or impede the ability of people to communicate, compete, and innovate using the Internet.

I am pleased that Majority Leader Reid has indefinitely postponed Senate consideration of the PROTECT IP Act, and I will continue to review and work to improve legislation to both protect the intellectual property of American businesses and to ensure the web remains free and open.  As I consider proposals to address these issues, I will keep your views in mind.

Thank you again for contacting me on this topic.  Please don’t hesitate to reach me again on this or any other issue in the future.

From Senator Scott P. Brown <sbrown@scottbrown.senate.gov>

Dear Dr. Stewart,

     Thank you for contacting me regarding the Preventing Real Online Threats to Economic Creativity and Theft of Intellectual Property (PROTECT IP) Act (S. 968).  I am strongly opposed to this legislation.

     As you know, Senator Patrick Leahy (D-VT) introduced S. 968 on May 12, 2011.  The PROTECT IP Act aims to provide law enforcement with tools to stop websites dedicated to online piracy and the sale of counterfeit goods.  However, many Americans feared that S. 968 would stifle freedom of expression and harm the Internet.

     The Internet has been a source of dynamic growth in our economy and is responsible for employing many people in Massachusetts.  I have very serious concerns about increased government interference in this area and the effect of the PROTECT IP Act and the Stop Online Piracy Act (H.R. 3261, House companion legislation) on the Internet.  On January 18, 2012, I announced my opposition to the PROTECT IP Act.  You will be pleased to know that with opposition to the bill mounting, on January 20, 2012, the Senate Majority Leader announced that the scheduled vote on the PROTECT IP Act has been indefinitely postponed.

     Again, thank you for sharing your views with me.  As always, I value your input and appreciate hearing from you.  Should you have any additional questions or comments, please feel free to contact me or visit my website at www.scottbrown.senate.gov.

 
Well.  The letter from Sen. Brown is completely straightforward.  Internet Good, PIPA Bad.  The letter from Sen. Kerry is quite a piece of mealy-mouth apology for the entertainment industry. However, Sen. Kerry is willing to admit that PIPA “needs work”.
I kind of think the right thing for Massachusetts might be Elizabeth Warren and Scott Brown.  Too bad Sen. Kerry is not up for reelection.
 

A Debugging Story

I’ve been working on fos at MIT CSAIL in recent months. fos is a factored operating system, in which the parts of the OS communicate by sending messages to each other, rather than by communicating by shared memory with locks and traps and so forth.  The idea of fos is to make an OS for manycore chips that is more scalable than existing systems.  It also permits system services to be elastic – to grow and shrink with demand, and it permits the OS to span more than one box, if you want.
The fos messaging system has several implementations.  When you haven’t sent a message to a particular remote mailbox, you send it to the microkernel, which delivers it.  If you keep on sending messages to the same place, then the system creates a shared page between the source and destination address spaces and messages can flow in user mode, which is faster.  Messages that cross machine boundaries are handled by TCP/IP between proxy servers on each end.
I’ve been making the messaging system a bit more object oriented, so that in particular you can have multiple implementations of the user space shared message message transport, with different properties.After I got this to pass the regression tests, I checked it in and went on to other stuff.
Charles Gruenwald, one of the grad students, started using my code in the network stack, as part of a project to eliminate multiple copies of messages.  (I added iovec support, which makes it easier to prepend headers to messages), and his tests were hanging.  Charles was kind enough to give me a repeatable test case, so I was able to find two bugs.  (And yes, I need to fix the regression tests so that they would have found these!)
Fine.
Next, Chris Johnson, another one of the grad students, picked up changes from Charles (and me) and his test program for memcached started to break.
All the above is just the setup.  Chris and I spent about two days tracking this down…
Memcached is a multithreaded application that listens for IP connections, stores data, and gives it back later.  It is used by some large scale websites like facebook.com to cache results that would be expensive to recompute.
When a client sends a data object to memcached for storage, memcached replies on the TCP connection with “STOREDrn”.  On occasion, this 8 character message would get back to Chris’s client as “”, namely all binary 0’s.  Since the git commits between working and not working were associated with my messaging code and the new iovec support, it seemed pretty likely that the problem was there.  However, the problem occurred with <both> the new implementations of shared page messaging, so it couldn’t really be anything unique to one or the other. That left changes in the common code or in the iovec machinery.
Now fos is a research OS, and is somewhat lacking in modern conveniences, such as a debugger, even for library code in user mode.  However, we have printf, and all the sources.
First, we added…  When I say “we” I really mean Chris, because he is a vi/cscope user, and I am emacs/etags.  I think he types faster too.
First we added a strncmp(“STORED”…) inside the  message libraries to locate the case. When the string matched, we set a new global variable to indicate a case of interest. We couldn’t add printf to all the messaging code because it is used all over the place, by many system services. There would be too much output and general flakiness.  Now, with the new global, we could effectively trace down into the messaging libraries, watching the “STORED” go by and printting if it disappeared…. which it did.
However, we got lots of disappearance messages, many due to other messages being sent. Since we also suspected the iovec machinery, we added printfs to print the number and sizes of the iovecs, and their contents.  One of the places we came across was in the fos dispatch library, which is an rpc mechanism that prepends a header on an existing message. The iovec form of this does something like
struct iovec new_iovec[in_iovcnt + 1];
to allocate a variable length array on the stack. Now this is a feature added to the C language as part of ISO C99, and supported in GCC in C90 or C99 mode, but it makes me nervous.  Just in case, we changed the declaration to
struct iovec new_iovec[10];
but it made no difference.
Eventually we found that the “STORED” was there on entry to a function called “sendData”, but had vanished before the sending.  And there were no references to the buffer in the interim.  This suggests that someone is using a pointer after freeing it, and the space has been reallocated to our data buffer, but then clobbered by someone else.  All there was separating the “STORED” from the “”  was a check of the fos name cache to see if the destination mailbox was still valid. More printfs established that the data vanished in exactly the case that the name cache entry had expired, requiring a fos message send to the name server to get a refreshed copy.
A search of the name server library revealed no obvious problem, but there was storage allocation in there, which might be relevant, if in fact the heap had gotten scrambled.
Overnight, I looked at all uses of malloc and free in the messaging library and they all seemed OK, but I thought this was an unlikely idea anyway because the failure happened with both implementations of shared page messaging.
This morning Chris and I had the idea of printing the region around the “STORED” to try and figure out if only our data was changed or if the change was some larger area. This was difficult to tell, because the local region of memory was mostly 0’s already. There was an ascii string a little before our code “suffix” that was also clobbered. We didn’t know what that was, but cscoping and grepping through the entire source tree located it as a name attached to a memcached data structure.  It came to be nearby the “STORED” because memcached did a strdup of a procedure argument, which malloc’d space for the string out of the same general area of the heap.  This clue meant that a larger region of the heap was being clobbered, but we still didn’t know how much.
One aspect, incidently, of this whole affair was that the problem always happened at the same virtual address: 0x709080.  No idea why, but having a stable address makes it much easier to track.
Next, Chris added code to fill the 1024 bytes centered on 0x709080 with 0xFF, and printed what it looked like after the disappearance.  Now this is just gutsy.  We had no idea what data was there, or used by who, and we just overwrote it with the 0xFF pattern, hoping the system would survive long enough to print the “after” pattern.  In fact it crashed immediately, but by changing the size of the 0xFF region, we learned that the clobber affected exactly 136 bytes, all 0’s except the first, which was 0x20.
Well 136 is an odd size.  We grepped the whole code base, to look at any 136s, but did not find any.
Next, we wondered if the clobber might be made by someone calling memcpy or memset. Since the address was stable, we were able to add code to the memcpy library routine something like this:
if (ptr < 0x709080 && (ptr + size) > 0x709080) printf(arguments to memcpy)
But we didn’t get any printfs <at all>… including our own initialization of that space.  We realized that gcc includes an “intrinsic” implementation of memcpy, which it will use when the actual arguments make it convenient .. such as knowlege that the pointers are 8 byte aligned and the length is a constant, or like that.  Now it is possible to turn off the compiler intrinsic by using the -fno-builtins flag to the compiler, so we dug into the fos Makefiles to add this to CFLAGS.
Now we got printfs from memcpy, and a nearly immediate page fault caused by running out of stack space.  It turns out that some variants of printf call memcpy internally, and we had managed a recursive loop.  We also got way too much printout, because we had adding printing to the library copy of memcpy, used by all applications and services. We got out of that by having the memcpy test code check the magic global variable to see if we were inside the code region of interest as well as a second magic variable set only in the memcached application.  We also added a call to print the return address of the caller of memcpy so we could identify who was making the call.
We didn’t find any useful memcpy calls, so we added the same logic to memset.
Widening the test for addresses to cover the entire page containing 709080 we found two 8 byte memset calls to the region right before 709080 but not including 709080.  These calls came from inside the libevent library used by memcached to dispatch work. libevent was preparing a call to select(2). The nearby code was crealloc’ing the file descriptor bit masks and then using memset to zero them out before calling select. This seemed unrelated to our bug, since the memsets didn’t overlap our “STORED” buffer.
Now what?  This could be a storage allocator usage problem, with someone using heap storage after calling free on it, or it could be a buffer overflow problem, with someone writing off the end of an array, but these things are difficult to find.  We thought about replacing malloc with one that carefully checked for some error cases, by putting sentinels around allocated storage.  Even worse, the problem could be that the page of memory had become shared with some other address space, at entirely different virtual addresses.  After all, the suspect messaging code does things like that.
Someone said. “If we had a debugger, we could just use a watchpoint”.  A watchpoint is a way of saying “let me know when this memory location is changed”.  But we had no debugger.  I thought, well, these x86 processors we are using have hardware to support watchpoints, how does it work?
Some work with google and the Linux kernel cross reference website showed that gdb implements watchpoints by using the linux ptrace system call, which in turn, through some elaborate machinery, eventually sets some debug registers deep in the x86 processor chip.  At that point, once any program touches the watched location, the chip generates a debug interrupt, at which point the OS returns control to gdb, letting it explain to the user what happened.
Now we didn’t have gdb, and fos doesn’t have ptrace, and we’re not even running directly on x86 hardware, we’re running inside of a Xen virtual machine hosted by a linux OS, but how hard could it be?
We decided to implement support for hardware watchpoints in fos.
We added a new system call “set debug register”, with no security whatever.  The user program just does this new syscall, passing raw bit values for the debug register.  The microkernel takes the argument, and calls HYPERVISOR_set_debugreg(), which Xen thoughtfully supplies to do the heavy lifting.  We added a second system call to read back the register.
A careful reading of the fos interrupt handlers seemed to say that the debug interrupt, while not expected to be used, did have a default handler in place that would print the machine registers and then crash.
Now, we called this new function to set a hardware watchpoint to 0x709080, and another to turn on the watchpoint control register.  Nothing happened.  We read back the registers, and they seemed to be set to the right bit values, according to wikipedia (and the Intel x86_64 processor reference manual). Now this could happen because we got the code wrong, because Xen didn’t in fact implement this functionality, or who knows. So we added another call to memcpy to overwrite the “STORED” ourselves, and we got an immediate crash dump.
This meant that the mechanism was working, but it wasn’t finding the clobber.  That probably meant that whoever was doing the clobber was running on a different processor core, each of which has their own debug registers.
Now the right way to handle this is for the set_debugreg system call to send messages to all the other cores on the machine to set their debug registers, using inter-processor interrupts.  fos doesn’t have any IPI, and in fact has no way to communicate to different cores in the microkernel.  The only place that needs to do this is the scheduler, which works by locking and then enqueuing processes onto the scheduler data structures of other cores.  No help to us.
But, all cores are running timer interrupts!  So inside our “set debug register” system call, we copied the arguments into microkernel global variables, and set up an array of flags, one per core.  The system call set all the flags to “true”.  Now in the timer interrupt, every core would check the flag for itself, and if set, copy the values in the global to the core’s local debug registers, then clear the flag.
The system call would spin on the flags until they were all clear again, then return to user mode.  This is a really hacky way of having all cores load the 0x709080 into their debug registers at the right moment.
Now this was a little bit of a hail mary. The x86 debug registers work by virtual addresses, so if the clobber were happening because the page was shared, and shared with a different VA, then we would not catch it.
But we did!  We ran the test, which waited until the “STORED” was there, then set the debug registers for 0x709080, and proceded.  We got a crash dump, and the return address on the exception stack was…libevent’s implementation of support for select(2), running in memcached, but in a different thread, on a different processor core than the thread sending “STORED”.
Now all we had was the program counter. We could identify the function by using “nm” to print the symbol table for the memcached executable, but getting to the source line of code is harder.  We found useful switches in objdump, -d -S, which print a disassembled listing of the binary executable code, interspersed with the source code, provided the file was compiled with the -g flag.  That took another spin through the fos Makefiles, which were using -g3, which is evidently some slightly different version of -g that is not compatible with objdump.  Now we were able to see the offending source line as…
FD_ZERO(readset);
or similar.  This is code that is zeroing the file descriptor bit vector about to be used in a call to select.  This was not found by our instrumentation of memset because FD_ZERO was still apparently using a compiler intrinsic, just a straight line set of moveq instructions to zero 128 bytes, in the middle of which was our “STORED” buffer. I’m not sure if -fno-builtin didn’t work for this, or it was controlled by a different makefile for CFLAGS or what.
… FD_ZERO was zeroing 128 bytes of a buffer that had recently been allocated with only 8 bytes of memory.
Now here is another bit of unix/linux history, I think.  When select was first defined, I think by the BSD folks at UC Berkeley, the sizes of the file descriptor bitmaps were variable, and needed to be only large enough to hold the maximum number of file descriptors under consideration.  At some point, linux, blessed later by the POSIX standards committee made the size of select descriptor arrays fixed, with a system specific constant.  In our case, the version of libevent we had was BSD derived, with variable size descriptor arrays, but calling into a select client that was POSIX derived, and expecting a (larger) fixed size.
Incidently, the 136 byte clobber was also now explained, the select code was FD_ZEROing both the readfd and the writefd arrays, which were 8 bytes apart in memory, leading to two overlapping 128 byte clobbers adding to 136 bytes.
The fix to this bug was updating the libevent select client to use fixed size descriptor arrays.  This bug had nothing at all to do with the iovec or messaging code. We just happened to run into it there because the chance coincidence of our messaging buffer containing “STORED” being allocated right after the select descriptor arrays that were too short.
-Larry
 
Followup:  My colleague Matteo Frigo reports:

FD_ZERO is written in assembly (the most misguided "optimization" ever?):
[from <bits/select.h> on glibc/amd64:]
# define __FD_ZERO(fdsp)
 do {                                                                       
   int __d0, __d1;                                                          
   __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS                        
                         : "=c" (__d0), "=D" (__d1)                         
                         : "a" (0), "0" (sizeof (fd_set)                    
                                         / sizeof (__fd_mask)),             
                           "1" (&__FDS_BITS (fdsp)[0])                      
                         : "memory");                                       
 } while (0)