In the early days of our startup, bubblegum and duct tape seemed to be the order of the day as we struggled to keep things running on cheap as chips computers bought off ebay and a ragtag bunch of borrowed Dell Optiplexes.
Developer files sat on their individual machines, source code was scattered across the place and the concept of centralised document storage was a share on one of the developer machines called Common in which everyone dumped their stuff.
A year into this rapidly escalating mess I took matters into my own hands and pestered the boss for a £1500 budget to build a file server. A Supermicro SC-743 Cool & Quiet Case coupled with a top notch Xeon board, 8GB of RAM, Intel Quad Core CPU and a top of the line 3ware 9690SA RAID card (with battery backup no less!) meant we were about to take our file server (the aforementioned developer’s machine) from a mewling kitten to a roaring tiger.
The whole thing was assembled beautifully and worked a treat, with a RAID 1 mirror for the Debian installation and 8x Seagate 7200.11 hard drives for the RAID 10 storage array.
In building this machine I made one and only one mistake. All of the drives were the same make and model and doubtless all manufactured at the same time.
Fast forward 12 months and on coming into work on Monday morning I saw a mail from the 3ware monitoring manager: ‘Drive 4 dropped out of array’. Not a problem I thought, we had a monthly offsite backup in place. I hopped online and ordered a spare disk.
Later that afternoon I received another alert: ‘Drive 6 dropped out of array’.
‘Sh****t’ I (probably) exclaimed realizing that if the second drive had dropped out of the same stripe as the previous drive our array would have been toast. I quickly ordered two more drives.
Making hasty backups and crossing fingers I awaited the arrival of the new drives the following day and on their arrival stuck one in to replace the failed disk. A few hours after successfully rebuilding the array I saw another disk fail.
It was at this point that I got down on my knees and began to pray. (I’m just kidding – I did that that the day before).
On a hunch I removed and reinserted the failed drive. It initialized and rebuilt fine. A few hours later one of the new drives dropped out. Over the next few days I was barely playing catch up in ensuring the RAID array didn’t fail entirely with drives dropping out 1-2 times a day and then initializing on reinsertion.
We were making daily backups by now but since this was our main file server and we were going through a pretty lean month it meant that we had zero budget to replace all the disks or get another box.
It was then that I exercised my Google-fu and hit the internet. Turned out Seagate had a bad batch of 7200.11 disks and had issued a firmware update.
The duty of taking the box offline after work and updating the firmware of all 11 drives fell on my shoulders. This ghastly process involved sticking all the disks, one at a time into a desktop and running the firmware update on each one.
Since then the array has run like a champ. We kept it with the original 8 disks and 3 hot spares for good measure…it’s been 7 years and nary a complaint from 3ware’s management tool.
Fast forward to 2013 and our latest storage purchase was a lovely Synology 10 disk NAS. Quick and (very) quiet it came populated by the manufacturer with 10 2TB Seagate disks (Enterprise models no less!). We loaded it up with our data and enjoyed the feel of the new shiny, flashing its pretty lights at us from the equipment rack.
Fast forward 12 months and you guessed it, a drive dropout. Then another, and another, followed by another. Over the course of 6 months we must have replaced more than half of those damned Seagate drives.
Moral of the story? Don’t buy Seagate.
Heheh, just kidding (maybe)…moral of the story is not to buy the same brand and batch of hard disks when speccing your storage array. Since those early days of scraping by we now build some pretty powerful RAID arrays for our customers and we always try and use a 50/50 mix of different brands and batches.
(We also make a lot of backups!)