enjoying salad since 1978.

Sunday, May 15, 2005

Aaron Swartz on losing his webserver's disk drive

I know this pain. I lost a CVS repository at Pyra back in 2002 this way. We were too cheap back then to use RAID on anything that wasn't BlogSpot or do backups on anything that wasn't our SQL Database. It's a good thing only two or three of us were actually working in it because if it had been 10 then life would have really sucked. (Actually, if it had been 10 people, I would have taken care of it). As it stood, we simply bought a new drive, copied the latest version from my Powerbook, merged the others, and were back in business.

Aww crap, we lost that drive a month later. I said "F*** this, we're going to Google in two months, no more source control until then." Since I was working on the backend and Ev/Sutter worked on the frontend, it didn't really matter.

The industry is great at throwing around sloppy numbers, like a modern SATA drive having a million hours of uptime between failure (MTBF). Wow, that sounds so impressive. A freaking million!

Actually, those numbers are calculated assuming the drive runs from 9-5. So, the number is more like 140,000 hours MTBF. Well, some sloppy math of my own tells me that if a datacenter has 50,000 disk drives in it, one will fail every 3 hours.

Sucks.

I do, though, backup my Subversion repository at home now. And my web data is on dreamhost, which has it's own backup schedule.

I think this is another good lesson for people wanting to work at a short-handed, underfunded startup (or, as they might start calling this in the future, a Paul Graham (or Grahamian)-style Startup). You're always making these kinds of tradeoffs: I made a decision that the time it would take me to build a RAID 10 system plus put together a solid and well tested backup strategy for a system that 3 people used would be better spent on the service that a helluva lot more people used (read: Blogger). I made the right choice but I still feel stupid about the second outage. The parameters of the choice change a lot when you have a ton of infastructure in place to help out (read: Google and their wonderful ops people).

4 Comments:

Blogger Greg Stein said...

RAID under Linux is way easy. I set up a RAID 1 system (mirrored drives) in about 30 minutes. Given the cost of drives nowadays, simple mirroring is just fine. No need to shoot for the fancy RAID 5 (or other levels).

I did ensure each drive was on a separate IDE controller for max performance, but that was it. Used mdadmin to create my drive, and I was off and running.

1:14 PM

 
Anonymous Anonymous said...

"Actually, those numbers are calculated assuming the drive runs from 9-5. So, the number is more like 140,000 hours MTBF."

Well, if they were measuring mean-days-between-failures, perhaps. They are supposed to be measuring the hours the drive is in use, after burn-in, and before failure and don't (shouldn't?) account for work's 9-5 hours. MTBF number are statistical, which does mean that your overall argument is entirely correct -- MTBF is a completely bullshit number. My experience with a large colo install where we had row after row of netapp cabinets actually better proves your point. We probably had about 20,000 drives and we lost about 7-10 a week. When it comes to MTBF, caveat emptor. The only people that believe those numbers are the guys in khakis. -tc

7:32 PM

 
Blogger Steve Jenson said...

Greg, thanks for the advice! I haven't tried setting up RAID on linux in almost 5 years. Sounds like things have really improved.

Anonymous: it's easy for people at smaller startups to forget that statistics is rarely on your side when it comes to data loss. ;-)


Wow, I'm really bad about responding to comments. Bad blogger!

10:27 PM

 
Anonymous John Wiseman said...

“some sloppy math of my own tells me that if a datacenter has 50,000 disk drives in it, one will fail every 3 hours.”

Wouldn't it be fun if all 50,000 drives failed at the same time? While that's not too likely, I suppose there probably would be some clustering.

“I've just picked up a fault in the AE-35 unit. It's going to go a hundred percent failure within 280,000 hours.”

11:32 AM

 

Post a Comment

Links to this post:

Create a Link

<< Home