Announcement

Collapse
No announcement yet.

Unstable server is unst... oh crap.

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unstable server is unst... oh crap.

    It's been a fun week so far.

    The server that runs our most important software had a drive crap out on Monday. Which is fine, since the array can run while short a drive. It may not be happy, but it works.

    We got a drive delivered same day and, as our contractor get set to install it, a second drive failed which tanked the server.

    The OS and data files got hosed at the same time.

    We've been down now for about 40 hours. The server has been rebuilt from the ground up, we got the software reinstalled, grabbed the latest data backup - from 1/2 an hour before the crash - and find out that the backup is corrupt. That's okay, we've got a secondary backup from not too long before that. Which is also corrupt...

    We're now looking at our third and final backup to see if that one is viable. If not, then we have to rebuild the database from scratch. *sigh*

    This is what I get for taking Friday the 13th off. It waited a weekend and was extra angry.

  • #2
    Mini-background: Lead <accountingsoftware> support dude and general office geek - Winstalls, printers, viruses, hardware upgrades. I don't do networks, as (a) I'm not trained for them, (b) He doesn't pay me enough to be trained for that, and (c) I'd rather leave that to the guy who has been doing nothing BUT server/network stuff for the past decade or so. If the server or n/w malfunction, we call our outside n/w dude. Also, if I don't mess with the server, he can't blame me when it goes *pop*crunch*blam*.

    We've had something similar to this happen...Boss has always been of the "does it still power up? If yes, then keep using it, there's no need for a new one" mentality when it comes to hardware. Sadly, this applies to the Server, as well...For reference, our old server was repurposed as a Workstation when we finally replaced it...and it wound up as the backup backup backup WS because it was then the most pathetic system in the office.

    Anyhoo...Fast forward to Katrina aftermath. New(ish) server that had been put in by an installer who promptly disappeared after the storm. N/W setup was...odd...to say the least -- If the server is down, or if it loses internet, we don't have ANY network connectivity, internet, nor (VOIP) phone access. We're freaking DEAD as an office til we get it back up again. Don't ask me why, as that might compel me to investigate it. Quite frankly, I'd rather not know. Note that, four years and two network dudes later, it's still that way. See above "if it ain't broke don't fix it" statement. The end result of this was that the outside n/w guy had to spend something like a week deciphering the arcane server setup that the prior dudes had done and getting it all working -- explicitly because the boss didn't want to just start over and do it fresh. Naturally, the labor for this far exceeded what the cost of doing it from scratch would have been (new guy kept finding tons of "WTF Why did they do THAT?! " problems, which he then had to fix).

    Anyhoo, on to the meat of the story. We have a backup service that (in theory) backs up our important files, including our accounting data files. Naturally, both that and our internal (symantec) backup software were not configured properly for something like three years or so...(I was only recently trained on these at all). We got occasional backups, but not a ton. On the online end, it wasn't always backing up because the boss -- who did this part -- didn't update the "folders to backup" properly and just assumed it was working. On the local end...Well, the C drive for the server was about 100gb, and it often had less than ONE gb of free space...>_> Not good. Our most recent n/w guy has been able to cobble together the best upgrades we could get for the aging hardware we have -- 2-drive mirrored SCSI RAID for the OS, another set of same for the Data Drive, backups configured properly, etc...

    Thing is, even with this setup, due to the age and other factors, the server MUST be shut down - let sit for a minute - Power cycle T1 and all connected hardware - restart server every week or so. Ideally, over the weekend when nobody but the boss is there and nobody needs to get in via TerminalServer. If we fail to do this regularly, then the server will just randomly lose all connectivity at some point in the following week...Say, at lunchtime on a Wednesday >_< Yes, this has happened. More than once. Could you tell? Fixing this requires a full reboot, about half an hour total. The boss eventually had me write out the detiled instructions on how to do this, for times when I wasn't there -- it is now FRAMED in the server room...

    Long story short: A couple of months ago, one of the drives in the C array gave out. I knew this only because the boss wasn't there that day and I had to do the weekly reboot. I noticed an error during the init, ran the diags, and called out n/w guy, who told me that a drive was failing. He was to come out in a couple days anyway for maintenance, so he said to just close the diags and let it reboot. I did so. The error message then appeared for BOTH drive in the C array and one of the data drives...Note that the boss had been seeing the error message for some time now ... It basically said something like "drive needs to be checked" or "potential drive problems" -- along with a "press CTRL-whatever to open diags"....and he never did. He just let it continue booting. For months.

    I called N/W guy back...he was there an hour later ^_^ He got to spend that entire weekend totally restoring much of the server, replacing 3 bad HD's and reconstructing as much as possible, instead of coming in for a 2-3 hour maintenance sweep.

    He has told the boss in no uncertain terms that he WILL have to get a brand new server within a year the way things are going, at which point n/w guy will build him a proper custom rig and set it all up from scratch (from his own office), bring it our office and plug it in so that we will have minimal downtime. I strongly suspect that that ain't gonna happen until the server dies an ignoble death in the middle of tax season or something and totally screws us all...>_> *knocks on wood*
    Last edited by EricKei; 11-21-2009, 12:49 PM. Reason: forgot a detail; part deux: I apparently cannot spell
    "For a musician, the SNES sound engine is like using Crayola Crayons. Nobuo Uematsu used Crayola Crayons to paint the Sistine Chapel." - Jeremy Jahns (re: "Dancing Mad")
    "The difference between an amateur and a master is that the master has failed way more times." - JoCat
    "Thinking is difficult, therefore let the herd pronounce judgment!" ~ Carl Jung
    "There's burning bridges, and then there's the lake just to fill it with gasoline." - Wiccy, reddit
    "Retail is a cruel master, and could very well be the most educational time of many people's lives, in its own twisted way." - me
    "Love keeps her in the air when she oughta fall down...tell you she's hurtin' 'fore she keens...makes her a home." - Capt. Malcolm Reynolds, "Serenity" (2005)
    Acts of Gord – Read it, Learn it, Love it!
    "Our psychic powers only work if the customer has a mind to read." - me

    Comment


    • #3
      Yeah, upper management has the same thought process towards hardware - it works now, so it will keep working for as long as we want it to.

      We managed to get the data from the third backup. However, we ran into some SNAFUs and FUBARs with the software installs which didn't get the server working (sort of) until Wednesday. For two and half days, starting Wednesday morning, we had the software support people trying to 'fix' an issue that prevented us from running credit cards properly for a number of room reservations. Which, as anyone in hospitality will tell you, is a Big F'ing Deal.

      Finally, on Friday, I got a bit caught up with my other duties (accounting) which had fallen by the wayside, and decided to take a look at the issues myself - having previously been told to 'leave it to the professionals' who still didn't have working results for us.

      Long story short, after out 2 hours, I figured out it was the ini file at 4:30 PM. As of 4:55 PM I had finished rewriting the ini file to the way it should have been from the start. Push my new ini file out to the server, reboot the workstations, and BAM! everything bloody well works fine.

      I guess I should have looked at the damned thing on freakin' Wednesday when we called them in.

      On the bright side, my boss was impressed that I did in 2 1/2 hours what the support people couldn't do in almost 3 full days. And now the system works properly again.

      Comment


      • #4
        and this is where you research:
        How much business/money did we lose due to being down, system wise?

        How much money would it cost to have a better, VERIFIED backup system AND BETTER HARDWARE?

        betcha a dozen donuts the hardware & backup would cost less.

        PissedNoob
        In my heart, in my soul, I'm a woman for rock & roll.
        She's as fast as slugs on barbituates.

        Comment


        • #5
          Quoth Cutenoob View Post
          and this is where you research:
          How much business/money did we lose due to being down, system wise?

          How much money would it cost to have a better, VERIFIED backup system AND BETTER HARDWARE?

          betcha a dozen donuts the hardware & backup would cost less.

          PissedNoob
          That's a sucker's bet. I'm keeping the dozen donuts for myself to help eat away some of last week's stress.

          Yeah, we're trying to point out how much cheaper it would have been even if just had a spare server that we could have thrown into place. Not to mention if we were running data replication on such a business critical dbase. *sigh*

          The stupid, it pains me. Right in the frontal lobe.

          Comment


          • #6
            I have a old workstation from 03, that was repoursed as a mission critical server. It has one HD, backed up nightly. It works, so we don't need to get a actual server with RAID and what not says management. I feel your pain.

            Comment

            Working...