Announcement

Collapse
No announcement yet.

Please, people, pay attention to your computers once in a while. (Long)

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Please, people, pay attention to your computers once in a while. (Long)

    As promised, the story of the server.

    Our server is a Dell PowerEdge 2900 in a tower case. Whoever spec'd this machine out was serious about its availability: it has two power supplies, three external modems, six hard drives in a RAID-5 configuration, etc. Everything is hot-swappable except the CPU itself, and it's got room for a second one: if we had two we could hot-swap that too if it became necessary. This is probably overkill for a small, low-volume pharmacy like us, but it's likely the standard setup from our vendor for all pharmacies they service.

    This machine happens to be running Windoze Server 2003. Now most updates to this OS, unlike XP, don't require reboots to take effect, but back in March 2012 we had to shut it down for some reason, so I rebooted it. Being the nosy person that I am, I hung around to read the POST messages as they go by, never having seen a Win2k3 boot before.

    Waittaminute. What did that just say?

    "Warning: RAID volume 4 of 6 degraded." Uh-oh, that doesn't sound good.

    "Warning: Power supply 1 of 2 off-line." Neither does that.

    The boot completed normally, so I decided to make sure our backups were current, just in case. There's a tape in the drive, but the datestamp on the files is from October, 2011. That definitely doesn't sound good.

    I check the other three backup tapes, and those are even older. Not good at all.

    (For those of you who aren't familiar, RAID-5 takes n drives worth of data and spreads it across n+1 drives. In our case, we have 5 drives worth of data on 6 physical drives; if one of them goes bad, any permutation of 5 drives can be used to recreate the sixth. One drive being degraded means we've had a drive crash, but because of the redundancy built into the system, we haven't lost any data . . . yet. If a second drive fails, though, we're screwed, which is why I wanted to make sure we had a current backup.)

    I go looking for the backup software, but I can't find it. There's no desktop icon, nothing in the start menu, no executable file in the C: drive that looks like it has anything to do with backups.

    So I call the help desk. I tell the tech about these errors, and that we don't seem to have been backing anything up for the past five months. I give him a remote login, and he starts looking around. Meanwhile I ask him where the backup program is, and he tells me it should be doing it by itself, without me having to tell it. Here's how the backup procedure works: In addition to the six drives in the RAID array, there's a seventh hard drive that's dedicated strictly to backups. The system backs everything up to this drive once every hour, and at some specified (but unknown to me) interval, it dumps all this to the tape, but for some reason this hasn't been happening. Turns out that this seventh drive has gone offline for some undetermined reason, and nothing's been going to the tape ever since.

    In any case, he says he's going to overnight me two new drives, one for the RAID stack and one for the backup. Next day, EdFex shows up with a couple of cartons, I pull the old drive and stick in the new one, the RAID rebuilds itself, life is good. Then I pull the seventh drive and try to put in the new one. System promptly hangs.

    Not good.

    The tech had been remoted in, but of course as soon as it hung he lost the connection. I pull the drive again and give it ye olde three-finger salute, and after a while the machine wakes back up. We repeat this dance a couple times, eventually he decides that the new drive is also faulty, he's going to overnight me another one. Here we go again. Long story short, we finally get the new new drive installed and initialized, and the backups start happening again.

    What about the third error, the off-line power supply? Well, that was simple, one of the two power cords had fallen out of the back of the machine. Stick that back in its socket, that error clears, and we're back in business.

    The point of all this rambling? We kind of dodged a bullet there. If I hadn't been nosy, nobody would have realized anything was wrong until a catastrophic failure would have let them know the hard way, because all that redundancy tends to cover up symptoms of failure. I can flat guarantee you that I'm the first person to read the POST messages on that machine since it was put into service four years ago, and I'm only there part-time. I have to wonder. How many other computers are out there, in pharmacies, pizza shops, clothing stores or what-have-you, with nobody ever checking "under the hood" until the thing goes down and leaves them stranded? Most people wouldn't treat their cars that way (modulo some of Argabarga's involuntary customers), so why their computers?

    More to the point, this is a leased server from a company that also provides the pharmacy software and takes care of our prescription billing, so the computer is in communication with them all the time. This being the case, how come they don't have a protocol in place such that when hardware errors start happening, the machine phones home and tells them, so a tech can pro-actively (G_d I hate that word, but it fits here) call us and let us know something needs fixing, rather than simply logging the errors in a file nobody ever looks at if they know it exists at all?

    So I guess the message is, look at your computers once in a while. See if they're trying to tell you something.

    (As an aside, some of that redundancy seems . . . redundant. There are two power supplies, but they're both plugged into the same UPS, so I don't really see the point. (OK, it protects against failure of the supply itself, but if we really wanted to do it right, they'd be plugged into two different outlets on separate circuits.) Likewise the three modems: all of them are plugged into the same phone line, daisy-chained one to the next . . . at least that's how it is now. Before I fixed it, we had one plugged into the line, the second chained from the first, and the third one, when I traced its cord through the tangled rats' nest of wires under the counter, I found it connected to itself. No wonder that one never got a dial tone... Of course if that one phone line goes down, none of those three modems are going to do us the least bit of good. We hardly ever use them anyway, they're only for backup in case the internet connection goes down, or for occasional fax-modem duty for refill requests. And yes, I did untangle the rat's nest and put rubber bands around wires to keep it that way.)
    Last edited by Shalom; 06-05-2012, 04:55 AM.

  • #2
    Most of our RAIDs have lights on the drive that change to orange if a non-fatal fault is detected, or RED (yes, all-caps red is a color) if a full-on failure happens. All you have to do is glance at the line of drives to check the hardware, which makes things really nice. I go into the server room at least every other day for backup tape changes, so I look through all the lights while the tapes rotate out.

    I still watch the POST when we boot a server. Because you never know. Good catch Shalom.
    The Rich keep getting richer because they keep doing what it was that made them rich. Ditto the Poor.
    "Hy kan tell dey is schmot qvestions, dey is makink my head hurt."
    Hoc spatio locantur.

    Comment


    • #3
      Wow, did you guys get lucky. I'd say dodging a bullet was an understatement here. That server really needs to have some sort of notification system in place for failures. Don't know why it doesn't.
      "If your day is filled with firefighting, you need to start taking the matches away from the toddlers…” - HM

      Comment


      • #4
        Have you considered using cloud services like Amazon's or Google's to back up your data? My employer is already doing that with our data.
        cindybubbles (👧 ❤️ 🎂 )

        Enter Cindyland here!

        Comment


        • #5
          Pharmacy data being backed up onto a system where unknown people would be able to access it. Those same people are not pharmacy trained, have no legal access to that data, and would be impossible to ever know if they had accessed it improperly.

          That would be a *huge* legal issue, and that's if everything went well.

          Comment


          • #6
            this is why i loved the raids i had on my old equipment.

            when a drive went bad or degraded the raid would beep SOS until the fault was cleared. Rebuilding was easy - no need to log in, just use the raid front panel to start the process and wait 3 hours. Sometimes that fixed degraded drives too. And if it didn't then we just replaced the drive and started the rebuild.


            The only real issue was that some officers got a bug up their ass about backing it up cos they didn't understand just how fucking stable the raids were. I mean um... NINE drives per raid, 2 raids in one system, 3 raids in another. (approx 1TB & 1.5TB). No way any backup would be more stable than that.
            Last edited by PepperElf; 06-05-2012, 05:08 PM.

            Comment


            • #7
              Quoth PepperElf View Post
              The only real issue was that some officers got a bug up their ass about backing it up cos they didn't understand just how fucking stable the raids were. I mean um... NINE drives per raid, 2 raids in one system, 3 raids in another. (approx 1TB & 1.5TB). No way any backup would be more stable than that.
              Doesn't matter how stable the raid is, it still won't survive this:



              Off-site (or even out-of-the-room backups are your only hope at this point. The only thing more important than performing regular backups is *testing* the regular backups.

              (And while this wasn't my company, I have seen it happen. Not pretty.
              "If your day is filled with firefighting, you need to start taking the matches away from the toddlers…” - HM

              Comment


              • #8
                Quoth PepperElf View Post
                The only real issue was that some officers got a bug up their ass about backing it up cos they didn't understand just how fucking stable the raids were. I mean um... NINE drives per raid, 2 raids in one system, 3 raids in another. (approx 1TB & 1.5TB). No way any backup would be more stable than that.
                Actually, that is probably less stable than a smaller RAID configuration. With an odd number of drives like that, it was probably RAID5, which means that data is written to each disk only once. With RAID1, it's written to at least 1 duplicate disk.

                Chances are also good that the drives were installed from the same manufacturing lot, so had the same MTBF and the same starting point. End result is that they were more likely to fail in groups than individually. And, with RAID5, if multiple drives go, the data on the RAID is gone.

                Backups aren't just for show. Neither is testing them. They're necessary, and a very good thing.

                Comment


                • #9
                  Backups are worthless. Restores are priceless

                  Quoth PepperElf View Post
                  The only real issue was that some officers got a bug up their ass about backing it up cos they didn't understand just how fucking stable the raids were.
                  RAID does not protect against file deletion or data corruption due to use error.
                  Life is too short to not eat popcorn.
                  Save the Ales!
                  Toys for Tots at Rooster's Cafe

                  Comment


                  • #10
                    According to Shalom, the hardware/service is the responsibility of the 3rd party tech company. Personally I'd write a note to DM and say, dude, they're not pulling their weight. I had to CALL for help, and that's to get the updates that THEY didn't get.
                    In my heart, in my soul, I'm a woman for rock & roll.
                    She's as fast as slugs on barbituates.

                    Comment


                    • #11
                      Quoth csquared View Post
                      Backups are worthless. Restores are priceless
                      Amen, brother. WAY too many IT "professionals" forget that.
                      "If your day is filled with firefighting, you need to start taking the matches away from the toddlers…” - HM

                      Comment


                      • #12
                        Quoth csquared View Post
                        Backups are worthless. Restores are priceless

                        .
                        that is SO true.

                        when I was doing the disaster recovery plan/procedure for the last big company (it was a company that built and stored trade show exhibits) I worked for, I planned for the worst possible scene ie. either the facility being completely destroyed by fire OR being totaly destroyed by an airplane (we were very near the landing and takeoff paths of a MAJOR very busy airport.

                        we ran through the scene at least once before I left. the most data we had onsite was one week. everything else was offsite as well as the commercial backup facility.

                        as for internal stuff a couple of the machines I was responcible for had internal RAID configurations.
                        I'm lost without a paddle and headed up SH*T creek.
                        -- Life Sucks Then You Die.


                        "I'll believe corp. are people when Texas executes one."

                        Comment


                        • #13
                          Quoth Geek King View Post
                          Most of our RAIDs have lights on the drive that change to orange if a non-fatal fault is detected, or RED (yes, all-caps red is a color) if a full-on failure happens. All you have to do is glance at the line of drives to check the hardware, which makes things really nice.
                          Our server looks like this: From top to bottom, there are the DVD-ROM drive, the tape drive (some weird Iomega thing), and the backup hard drive in its hot-swap slide-out carrier. The other six drives are hidden under the bezel. Once I removed the bezel, it was obvious which drive had failed; it was the only one whose I/O light wasn't flashing. Now you mention it, the power light might have been orange as well, I don't recall exactly. Problem is, unless you take off the bezel, you can't see the drives; out of sight, out of mind.

                          (The factory bezel looks like this: Someone seems to have taken off the top half of ours, though, presumably so you can access the tape drive without removing the whole thing.)

                          That's a new UPS, by the way. I came in last week and found a red light and an intermittent beeping noise. It had been doing this for a few days and nobody thought to do anything about it; probably said "Well Shulem will be here Tuesday, he'll take care of it then." I called the vendor and reported this, and they said they'd send me a new battery and instructions on how to install it. Then this huge box shows up. Turned out they'd sent me a whole new UPS. I actually replaced it without shutting down the computer (two power supplies, remember.) That was the weirdest experience, pulling the plug on a running server and having it keep merrily rolling along. I also spent a good part of the afternoon under that counter, rolling up and tying wires and getting my white coat filthy. It had resembled an epileptic octopus down there; now it's neat. Well, neater.

                          I still watch the POST when we boot a server. Because you never know. Good catch Shalom.
                          Thanks.

                          Quoth Crossbow View Post
                          Wow, did you guys get lucky. I'd say dodging a bullet was an understatement here. That server really needs to have some sort of notification system in place for failures. Don't know why it doesn't.
                          I think it actually might . . . except that if it does, it's notifying an internal email account that nobody reads, or even knows exists in the first place. When this started happening, which was around April 4 by the time we'd got done horsing around with the backup drive, the tech called my attention to an Outlook icon on the desktop, and told me to click on it. It promptly began downloading internally-generated emails from mid-2010, which was apparently the last time it had been clicked on. There may have been an error report buried in the blizzard of SQL Server Reports (all of which had a file attachment that was about 7 lines long, telling me exactly nothing useful), records of drug database updates, and $DEITY knows what-all else, but if so I couldn't find any.

                          Quoth Crossbow View Post
                          Off-site (or even out-of-the-room) backups are your only hope at this point.
                          I think that's how it's supposed to work here. We have four tapes, labeled Daily-1, Daily-2, Weekly and Monthly. All four of these are supposed to have the same stuff on them, more or less. It's not an incremental backup, if I understand it correctly, each tape is an independent full backup. We're supposed to change the Daily tapes daily, once a week put the Weekly one in there and then stick it in the safe, and once a month put the Monthly one in, and then take it home and put it in your house, which I guess is to protect it in the event of what you showed above. Unfortunately our safe isn't very, so all the tapes not currently in use are stored in a consumer-grade fireproof box on the shelf next to the computer. Including the monthly one, until I explained to my boss its purpose. I think he's got it in the business office, which is in a different building on the other side of the village; just have to remind him to bring it back to the store once in a while...

                          Quoth Pedersen View Post
                          Backups aren't just for show. Neither is testing them. They're necessary, and a very good thing.
                          I hope that the system is set up to read back what it writes to the tape as soon as it writes it. There doesn't seem to be any system for doing test restores; perhaps there's an automated one, just as the backup process itself is.

                          Quoth Der Cute View Post
                          According to Shalom, the hardware/service is the responsibility of the 3rd party tech company. Personally I'd write a note to DM and say, dude, they're not pulling their weight. I had to CALL for help, and that's to get the updates that THEY didn't get.
                          I don't think we actually have a DM. Above me is the manager/head tech; above him is the owner, and that's as far as the chain of command goes in this place. Mαrt Of Heαlth is a franchise (or maybe just a buyer's group, not sure) and each branch is independently owned.

                          But I am going to call them and ask them if they have such a program. They're not just tech support, they're also our main drug wholesaler, so if anything happens to us, they lose money too.

                          The funniest part of all this is, this isn't what they pay me for. I'm a pharmacist, not an IT guy; I'm there to fill prescriptions and counsel patients, not fix the damn computers, but somehow wherever I work I wind up doing this stuff . . .
                          .
                          Last edited by Shalom; 06-06-2012, 03:32 PM.

                          Comment


                          • #14
                            Quoth Shalom View Post
                            The funniest part of all this is, this isn't what they pay me for. I'm a pharmacist, not an IT guy; I'm there to fill prescriptions and counsel patients, not fix the damn computers, but somehow wherever I work I wind up doing this stuff . . .

                            You know how to do it, and apparently have a very strong work ethic. That would be why.

                            Comment


                            • #15
                              Quoth Pedersen View Post
                              Pharmacy data being backed up onto a system where unknown people would be able to access it. Those same people are not pharmacy trained, have no legal access to that data, and would be impossible to ever know if they had accessed it improperly.

                              That would be a *huge* legal issue, and that's if everything went well.
                              This problem can be solved using the right encryption set-up... Needs to be double-checked, to make sure nothing's leaking, but it's doable.

                              Comment

                              Working...
                              X