Gone and Back

The server that hosts this website and around 20 other sites had a fault hard disk, which caused a kernel panic last Wednesday. It came back on-line on Sunday, with a new 120Gb disk, a new Gentoo rebuild, upgraded to a fast Pentium III 450Mhz box, and has been running happily so far.

Here is the whole saga.

24 May 2004

I first spotted DMA error in my kernel logs. For the next few weeks, the same error occurs on Sunday mornings whenever my back up cron job hits, but does not seem to be anything major. The kernel message looks like this:

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=414088, sector=414024
end_request: I/O error, dev 03:01 (hda), sector 414024
vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [128299 128650 0x0 SD]

The drive was a Seagate Barracuda IV 60Gb, and it was only 18 months old. I was lazy, so I did not take any action.

22 June 2004 00:56 AM

A different disk error message has been generated every 30 minutes or so. Error message this time:

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }

When I noticed it, it was already late in the evening on the 22nd, and the rate of kernel spilling error message has gong up to once every few minutes. So I went, buggar. Ordered a new Western Digital WD800JB drive from EYO straight away. Finger crossed.

23 June 2004 01:13 PM

This is the time stamp of the very last message I have in syslog, before the kernel went panic and hung itself. D'oh. Well, not completely kernel panic. The server would respond to an ICMP ping, but would not let me log in or do anything.

23 June 2004 02:25 PM

I posted the first message on FOCUSer.net off-site status log, which I created over a month ago to report availability just in case something nasty happens. And yeah, just over a month and there was actually a need for it.

I changed the DNS entries to point yang.id.au, focuser.net and focus-unsw.org to another IP address, and then set up some temporary pages stating the sites were off-line. I also got PowerDNS to set up email re-direction to my Gmail account.

23 June 2004 06:00 PM

Came back home to check out how the server was. Dug out a monitor and a keyboard from the garage to attach to the server, and discovered the kernel panic. Alright. Press on Reset, and then wait for grub to re-start my old Mandrake 9. Upon decompressing the bzipped kernel, it encountered a CRC error. Can't bootstrap. :(

The rest of the evening is on rescuing the disk. Basically, I booted with Knoppix 3.3 which came with APC magazine 3 months ago, mount the drive, and figured out that mount does not like it as superblocks need to be rebuilt. reiserfsck --rebuild-tree does not like it because of the bad sectors, and running badblocks seems to take forever...

24 June 2004 12:00 AM

I need to build a new binary of reiserfsck because the one came with Knoppix 3.3 is too old to support badblocks parameter. At the end I have to ssh to my work computer to do it as there is no other Linux box at home...

Anyway, the new reiserfsck seems to do the job fine, and rebuilt all the superblocks on the corrupted file system, while successfully avoiding the bad sectors. Most files are recovered, but there are still some lost. By the time I went to sleep it was 2 AM in the morning, but I was happy :)

24 June 2004 08:00 AM

Woke up. Check email (yes, Knoppix lets me to run all kinds of services + ADSL connection via PPPoE, all without installing anything onto my hard disk) and realised that the drive I ordered was actually on back order. I quickly fired a new email to EYO to change my order to another (i.e. bigger) drive, also complained about the 60Gb 'Cuda I got from them 18 months ago was faulty.

The nice guys at EYO replied at around 10 AM, changed the order for me and dispatched the drive. They also gave me an RA number so I can return the sick 'Cuda for warranty.

The new drive arrived on my desk at 4 PM in the afternoon. That was quick :)

24 June 2004 07:15 PM

Have to drive to Berowra (10 minutes north of Hornsby) to pick up a baby capsule that I bought from eBay. By the time I arrived home, it was already 9:30 PM. Time to back up all my data first. I basically tar-gzipped and copied around 30 Gb worth of data to the hard disk of 3 other computers at home. It was quite slow because for the whole time the sick drive was working in PIO mode.

25 June 2004 12:30 AM

All files backed up! Time to change the hard disk. It turned out to be quite a challenge, as the server was sitting behind my desk, with ADSL modem, Ethernet hub, wireless AP sitting on top of it. It took a while to untangle all the cabling and move my desk before I can move the server box out.

However, it appears that power supply of my old dual celery is also dying, which you can tell straight away from the fan noise. Instead of finding a replacement power supply, I just plug the new disk into another new box that I recently inherited - a Pentium III 450Mhz on BX chipset.

I got this box from PP when she left Australia on the previous Saturday. I was excited when I heard that it was a Pentium III - hey, any P3 would be faster than my dual 400Mhz Celeron. But to my disappointment, it was actually the slowest of the P3's, has only 64Mb of RAM, and get this - 800Mb of hard disk! The hard disk I bought for my 100Mhz Pentium in summer between '95 and '96 also happened to be a 800Mb as well. This is an ancient piece of hardware!

Anyway, I moved my old 6x speed CD burner, new hard disk and 416Mb of RAM onto this new box, booted up with Knoppix 3.3 and everything seems to be working. I then got on-line using ADSL, partitioned and reformatted the drive, downloaded stage 1 from Gentoo's mirror site, unzipped the tarball, and started bootstrapping!

By the time I headed to sleep, it was already 4 AM in the morning. Totally exhausted.

25 June 2004 06:00 PM

Came home from work, and stage 1 building seems to have completed. Stage 2 started. Pain of installing a Gentoo, but I am looking forward to the reward.

26 June 2004

The whole Saturday is building. Stage 2 completed when I got back home on Friday night, and I started building services. Apache, PHP, MySQL, Squid, Samba, Postfix, etc. I am trying to get most things built before I bootstrap the new kernel.

I also figured out that PP has set the clock of her computer 1 months in the future! And when I used ntpdate to sync the computer clock in the middle of building, the whole compilation process stuffed up. Yeah, stuffed up big time. All my glibc header files and kernel header files were installed in August 2004, and dependency killed my emerge process. D'oh. Don't want to restart again, I was forced to do a:

# find /usr/include -name '*.h' -exec touch '{}' ';'

And similar thing to fix up Perl modules. Yeah. Next time rebuilding someone's old box - check the clock first!

27 June 2004

New server bootstrapped in the morning, and I am trying to get my mails and FOCUS mailing list re-configured. After church in the afternoon, Apache is configured and most FOCUSer.net sites are back.

The 450Mhz Pentium III is noticeably faster than my old dual Celeron. Hopefully I'll be able to get good up time on this one. Data has all been restored. Some files are missing, but I can easily pull out from my old back up so at the end all I have lost is time and lots of sleep.

And fortunately it is before the baby is due (less than 3 weeks!) Otherwise the server would still be dead by now.