Quick background: I'm a sysadmin for a large web hosting company. We have 2 data centers, one in the main office, one in an old word processing company's facility. I work by myself most shifts in the latter, as a caretaker for about 500 servers. We're in the middle of a LOT of upgrades, software and hardware. </background>
We had a meeting the other day for all lower level admins and level 3 techs with the CEO. There's a WTF involved with the CEO that I'll have to leave for another day, as I'm still somewhat miffed about the issue, and can't write well when miffed. To allow the day shift junior admin to attend the meeting, a senior admin I don't particularly like and have griped about before (the one who likes to make a public example of everyone who does anything remotely wrong) took his place.
After the meeting, I met up with the senior admin here at work. Right as I get in, the server the CEO's website is on crashes, and doesn't come back up after a reboot.
Senior admin takes a look, determines that /var is totally kaput - an fsck starts throwing all files in the partition into the lost+found directory. This means the partition is just gone, with no hope of saving it.
His answer to this problem is "well...have fun with this backup restore. I'm going home." >_<
I spent 9 hours that day JUST working with this box. The issues lasted for another 12 hours, which the grave shift got to finish handling.
Another box just up and died in the same manner. I'm in the process of trying to repair the damage now. Figures it goes tits up right as I'm walking out the door for a caffeine run. :-(
The cause for this? We just migrated /var**** and a lot of other partitions over to ext3* from reiserfs** (affectionately called murderfs in some circles) In the migration, the fstab*** was not updated to show the correct options that should be there to reduce filesystem damage over time. This means we have about 800 ticking time bombs between all data centers of boxes that may or may not come back from a reboot.
Geek to english key:
*: ext3 is the default filesystem in linux. it's a great filesystem, and handles files well. It's older, and is showing it's age. There are newer and better out there, but ext3 is most stable and reliable, and therefore default.
**: reiser was developed by Hans Reiser, recently convicted of murdering his wife, thus murderfs. It has a lot of neat features, and is best suited for small filesystems, but in practice tends to go timebomb in a large environment and just eat all your data. For this reason, we're getting off reiser.
***: /etc/fstab is the file in linux where all filesystems are recorded. you can mount a volume without it being in fstab, but for filesystems that must be mounted at boot, like in a server environment, this is where everything goes.
****: /var is a system partition, and is where log files, temporary files, and a few other generally dynamic items go. As a side note, mysql databases go to /var/lib/mysql. This is its own drive in most cases, fortunately, and would be unaffected by /var going kablooie.
We had a meeting the other day for all lower level admins and level 3 techs with the CEO. There's a WTF involved with the CEO that I'll have to leave for another day, as I'm still somewhat miffed about the issue, and can't write well when miffed. To allow the day shift junior admin to attend the meeting, a senior admin I don't particularly like and have griped about before (the one who likes to make a public example of everyone who does anything remotely wrong) took his place.
After the meeting, I met up with the senior admin here at work. Right as I get in, the server the CEO's website is on crashes, and doesn't come back up after a reboot.
Senior admin takes a look, determines that /var is totally kaput - an fsck starts throwing all files in the partition into the lost+found directory. This means the partition is just gone, with no hope of saving it.
His answer to this problem is "well...have fun with this backup restore. I'm going home." >_<
I spent 9 hours that day JUST working with this box. The issues lasted for another 12 hours, which the grave shift got to finish handling.
Another box just up and died in the same manner. I'm in the process of trying to repair the damage now. Figures it goes tits up right as I'm walking out the door for a caffeine run. :-(
The cause for this? We just migrated /var**** and a lot of other partitions over to ext3* from reiserfs** (affectionately called murderfs in some circles) In the migration, the fstab*** was not updated to show the correct options that should be there to reduce filesystem damage over time. This means we have about 800 ticking time bombs between all data centers of boxes that may or may not come back from a reboot.
Geek to english key:
*: ext3 is the default filesystem in linux. it's a great filesystem, and handles files well. It's older, and is showing it's age. There are newer and better out there, but ext3 is most stable and reliable, and therefore default.
**: reiser was developed by Hans Reiser, recently convicted of murdering his wife, thus murderfs. It has a lot of neat features, and is best suited for small filesystems, but in practice tends to go timebomb in a large environment and just eat all your data. For this reason, we're getting off reiser.
***: /etc/fstab is the file in linux where all filesystems are recorded. you can mount a volume without it being in fstab, but for filesystems that must be mounted at boot, like in a server environment, this is where everything goes.
****: /var is a system partition, and is where log files, temporary files, and a few other generally dynamic items go. As a side note, mysql databases go to /var/lib/mysql. This is its own drive in most cases, fortunately, and would be unaffected by /var going kablooie.
Comment