As promised, the story of the server.
Our server is a Dell PowerEdge 2900 in a tower case. Whoever spec'd this machine out was serious about its availability: it has two power supplies, three external modems, six hard drives in a RAID-5 configuration, etc. Everything is hot-swappable except the CPU itself, and it's got room for a second one: if we had two we could hot-swap that too if it became necessary. This is probably overkill for a small, low-volume pharmacy like us, but it's likely the standard setup from our vendor for all pharmacies they service.
This machine happens to be running Windoze Server 2003. Now most updates to this OS, unlike XP, don't require reboots to take effect, but back in March 2012 we had to shut it down for some reason, so I rebooted it. Being the nosy person that I am, I hung around to read the POST messages as they go by, never having seen a Win2k3 boot before.
Waittaminute. What did that just say?
"Warning: RAID volume 4 of 6 degraded." Uh-oh, that doesn't sound good.
"Warning: Power supply 1 of 2 off-line." Neither does that.
The boot completed normally, so I decided to make sure our backups were current, just in case. There's a tape in the drive, but the datestamp on the files is from October, 2011. That definitely doesn't sound good.
I check the other three backup tapes, and those are even older. Not good at all.
(For those of you who aren't familiar, RAID-5 takes n drives worth of data and spreads it across n+1 drives. In our case, we have 5 drives worth of data on 6 physical drives; if one of them goes bad, any permutation of 5 drives can be used to recreate the sixth. One drive being degraded means we've had a drive crash, but because of the redundancy built into the system, we haven't lost any data . . . yet. If a second drive fails, though, we're screwed, which is why I wanted to make sure we had a current backup.)
I go looking for the backup software, but I can't find it. There's no desktop icon, nothing in the start menu, no executable file in the C: drive that looks like it has anything to do with backups.
So I call the help desk. I tell the tech about these errors, and that we don't seem to have been backing anything up for the past five months. I give him a remote login, and he starts looking around. Meanwhile I ask him where the backup program is, and he tells me it should be doing it by itself, without me having to tell it. Here's how the backup procedure works: In addition to the six drives in the RAID array, there's a seventh hard drive that's dedicated strictly to backups. The system backs everything up to this drive once every hour, and at some specified (but unknown to me) interval, it dumps all this to the tape, but for some reason this hasn't been happening. Turns out that this seventh drive has gone offline for some undetermined reason, and nothing's been going to the tape ever since.
In any case, he says he's going to overnight me two new drives, one for the RAID stack and one for the backup. Next day, EdFex shows up with a couple of cartons, I pull the old drive and stick in the new one, the RAID rebuilds itself, life is good. Then I pull the seventh drive and try to put in the new one. System promptly hangs.
Not good.
The tech had been remoted in, but of course as soon as it hung he lost the connection. I pull the drive again and give it ye olde three-finger salute, and after a while the machine wakes back up. We repeat this dance a couple times, eventually he decides that the new drive is also faulty, he's going to overnight me another one. Here we go again. Long story short, we finally get the new new drive installed and initialized, and the backups start happening again.
What about the third error, the off-line power supply? Well, that was simple, one of the two power cords had fallen out of the back of the machine. Stick that back in its socket, that error clears, and we're back in business.
The point of all this rambling? We kind of dodged a bullet there. If I hadn't been nosy, nobody would have realized anything was wrong until a catastrophic failure would have let them know the hard way, because all that redundancy tends to cover up symptoms of failure. I can flat guarantee you that I'm the first person to read the POST messages on that machine since it was put into service four years ago, and I'm only there part-time. I have to wonder. How many other computers are out there, in pharmacies, pizza shops, clothing stores or what-have-you, with nobody ever checking "under the hood" until the thing goes down and leaves them stranded? Most people wouldn't treat their cars that way (modulo some of Argabarga's involuntary customers), so why their computers?
More to the point, this is a leased server from a company that also provides the pharmacy software and takes care of our prescription billing, so the computer is in communication with them all the time. This being the case, how come they don't have a protocol in place such that when hardware errors start happening, the machine phones home and tells them, so a tech can pro-actively (G_d I hate that word, but it fits here) call us and let us know something needs fixing, rather than simply logging the errors in a file nobody ever looks at if they know it exists at all?
So I guess the message is, look at your computers once in a while. See if they're trying to tell you something.
(As an aside, some of that redundancy seems . . . redundant. There are two power supplies, but they're both plugged into the same UPS, so I don't really see the point. (OK, it protects against failure of the supply itself, but if we really wanted to do it right, they'd be plugged into two different outlets on separate circuits.) Likewise the three modems: all of them are plugged into the same phone line, daisy-chained one to the next . . . at least that's how it is now. Before I fixed it, we had one plugged into the line, the second chained from the first, and the third one, when I traced its cord through the tangled rats' nest of wires under the counter, I found it connected to itself. No wonder that one never got a dial tone... Of course if that one phone line goes down, none of those three modems are going to do us the least bit of good. We hardly ever use them anyway, they're only for backup in case the internet connection goes down, or for occasional fax-modem duty for refill requests. And yes, I did untangle the rat's nest and put rubber bands around wires to keep it that way.)
Our server is a Dell PowerEdge 2900 in a tower case. Whoever spec'd this machine out was serious about its availability: it has two power supplies, three external modems, six hard drives in a RAID-5 configuration, etc. Everything is hot-swappable except the CPU itself, and it's got room for a second one: if we had two we could hot-swap that too if it became necessary. This is probably overkill for a small, low-volume pharmacy like us, but it's likely the standard setup from our vendor for all pharmacies they service.
This machine happens to be running Windoze Server 2003. Now most updates to this OS, unlike XP, don't require reboots to take effect, but back in March 2012 we had to shut it down for some reason, so I rebooted it. Being the nosy person that I am, I hung around to read the POST messages as they go by, never having seen a Win2k3 boot before.
Waittaminute. What did that just say?
"Warning: RAID volume 4 of 6 degraded." Uh-oh, that doesn't sound good.
"Warning: Power supply 1 of 2 off-line." Neither does that.
The boot completed normally, so I decided to make sure our backups were current, just in case. There's a tape in the drive, but the datestamp on the files is from October, 2011. That definitely doesn't sound good.
I check the other three backup tapes, and those are even older. Not good at all.
(For those of you who aren't familiar, RAID-5 takes n drives worth of data and spreads it across n+1 drives. In our case, we have 5 drives worth of data on 6 physical drives; if one of them goes bad, any permutation of 5 drives can be used to recreate the sixth. One drive being degraded means we've had a drive crash, but because of the redundancy built into the system, we haven't lost any data . . . yet. If a second drive fails, though, we're screwed, which is why I wanted to make sure we had a current backup.)
I go looking for the backup software, but I can't find it. There's no desktop icon, nothing in the start menu, no executable file in the C: drive that looks like it has anything to do with backups.
So I call the help desk. I tell the tech about these errors, and that we don't seem to have been backing anything up for the past five months. I give him a remote login, and he starts looking around. Meanwhile I ask him where the backup program is, and he tells me it should be doing it by itself, without me having to tell it. Here's how the backup procedure works: In addition to the six drives in the RAID array, there's a seventh hard drive that's dedicated strictly to backups. The system backs everything up to this drive once every hour, and at some specified (but unknown to me) interval, it dumps all this to the tape, but for some reason this hasn't been happening. Turns out that this seventh drive has gone offline for some undetermined reason, and nothing's been going to the tape ever since.
In any case, he says he's going to overnight me two new drives, one for the RAID stack and one for the backup. Next day, EdFex shows up with a couple of cartons, I pull the old drive and stick in the new one, the RAID rebuilds itself, life is good. Then I pull the seventh drive and try to put in the new one. System promptly hangs.
Not good.
The tech had been remoted in, but of course as soon as it hung he lost the connection. I pull the drive again and give it ye olde three-finger salute, and after a while the machine wakes back up. We repeat this dance a couple times, eventually he decides that the new drive is also faulty, he's going to overnight me another one. Here we go again. Long story short, we finally get the new new drive installed and initialized, and the backups start happening again.
What about the third error, the off-line power supply? Well, that was simple, one of the two power cords had fallen out of the back of the machine. Stick that back in its socket, that error clears, and we're back in business.
The point of all this rambling? We kind of dodged a bullet there. If I hadn't been nosy, nobody would have realized anything was wrong until a catastrophic failure would have let them know the hard way, because all that redundancy tends to cover up symptoms of failure. I can flat guarantee you that I'm the first person to read the POST messages on that machine since it was put into service four years ago, and I'm only there part-time. I have to wonder. How many other computers are out there, in pharmacies, pizza shops, clothing stores or what-have-you, with nobody ever checking "under the hood" until the thing goes down and leaves them stranded? Most people wouldn't treat their cars that way (modulo some of Argabarga's involuntary customers), so why their computers?
More to the point, this is a leased server from a company that also provides the pharmacy software and takes care of our prescription billing, so the computer is in communication with them all the time. This being the case, how come they don't have a protocol in place such that when hardware errors start happening, the machine phones home and tells them, so a tech can pro-actively (G_d I hate that word, but it fits here) call us and let us know something needs fixing, rather than simply logging the errors in a file nobody ever looks at if they know it exists at all?
So I guess the message is, look at your computers once in a while. See if they're trying to tell you something.
(As an aside, some of that redundancy seems . . . redundant. There are two power supplies, but they're both plugged into the same UPS, so I don't really see the point. (OK, it protects against failure of the supply itself, but if we really wanted to do it right, they'd be plugged into two different outlets on separate circuits.) Likewise the three modems: all of them are plugged into the same phone line, daisy-chained one to the next . . . at least that's how it is now. Before I fixed it, we had one plugged into the line, the second chained from the first, and the third one, when I traced its cord through the tangled rats' nest of wires under the counter, I found it connected to itself. No wonder that one never got a dial tone... Of course if that one phone line goes down, none of those three modems are going to do us the least bit of good. We hardly ever use them anyway, they're only for backup in case the internet connection goes down, or for occasional fax-modem duty for refill requests. And yes, I did untangle the rat's nest and put rubber bands around wires to keep it that way.)
Comment