Announcement

Collapse
No announcement yet.

So this week has sucked in a huge way (Long)

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • So this week has sucked in a huge way (Long)

    Monday was my first day back from a week off, so I expected to be busy and I was but not a big deal. I spent the day handling all the little crap that normally goes on and at the end of the day uttered the infamous words of doom, "Wow, that wasn't bad, I expected much worse". Oops.

    I get home around 7 p.m. and start seeing emails that all of our servers are dropping offline. This isn't good as it means we either lost power for a long time and all of them shut down or we lost internet. I was hoping and praying for the internet but I was able to ping the router so I knew we were screwed. There wasn't much I could do at that point so I just hung out and waited for notification of power restoration.

    Around 10 p.m. I got word that it was back on so I headed in to simply (I thought ) boot everything back up and go back home. I get there and start booting everything back up. I start with both domain controllers and while they start to boot I move on to the others. Right off I notice our content filter won't boot which is bad but easy to bypass if it comes down to it. Then the file server comes up with 2 drives bad. I quickly realize this wasn't a simple power outage.

    I go back to the DC's and our primary comes up with several errors. Oh shit. I log in and move to the backup. It won't come up at all. OH SHIT. Go back to the primary and start looking around, AD is still there but wiped completely out. DNS is wiped, same with DHCP. Now I realize just how screwed I am, but want to know what on earth happened. After looking through the event viewer it looks like power was restored and everything started to power on by itself (they are supposed to but we still go up after an outage just to make sure nothing screwed up booting, thus my trip). Halfway through booting we took a direct lightning strike to the building causing the UPS's to shut completely off rather than risk something getting through. Normally that would be good, but not in the middle of a boot. So, it turns out that I have lost both DC's, content filter, one of 2 terminal servers, and a days worth of edits on our file server.

    By now it is 1 a.m. and I have to have a DC back online by 7 a.m. so I get to work. I try a system state restore from backup but the management server and tape library are farked as well. I happened to have a spare virtual server already built without anything installed so I decide to go with that as a temporary fix. I add the roles and promote it. Seems fine, except everything is taking longer than it should. Start replication and start building DHCP. It won't activate.

    I spend an hour trying everything I know and everything my google-fu can turn up. Nothing. I decided to forget it for a minute and check out replication. 5 errors, and the service has stopped. At this point I began deciding whether this job was really worth the amount of work I had ahead of me, and decided it was so I kept on. I finally get enough replication that it will sort of work and DHCP activates but for some reason keeps deciding to deactivate. 7 a.m. rolls around and I don't have a true domain controller because it won't replicate, DHCP doesn't work right, and all clients have only the primary and backup DC's for DNS so they have no network access. They can get to our DR site because it is in the router so they can work, but no one has used it in three years.

    CIO shows up and I am expecting an 'attaboy for working all farking night but rather get a "WTF, WHY ISN'T THIS UP! THIS IS A NIGHTMARE! WE ARE F*&%ED! WHY DIDN"T YOU GET THIS WORKING!" The thoughts that ran through my head are not appropriate, so I will not share them but needless to say I wasn't happy. I worked for two more hours with everyone and their dog stopping by to tell me that the internet was down. Sorry you can't get to facebook right now, f*&% off.

    Boss tells me to go home and sleep and come back asap. I go home and sleep for two hours then come back. I start by showing every person how to log in to the DR site, and then how to work from there (click the friggin icon like you do on your own computer moron). Around 5 p.m. I am able to actually start working on our shitstorm of data corruption.

    Over the next three days I had to rebuild our terminal server, restore the file server, restore the management server for backups (really fun when you are trying to restore from backup the server that manages the backups) and build a secondary DC that we turned into primary because the other refused to work. By Friday I had everything online, held together with duct tape and hopeful thinking. This week has blown, to say the least and I am not done with it all yet. Ugh. I need several drinks.

  • #2
    OH wow... that's some really shitty luck. Taking a direct his unless you have a faraday cage built around the server room you are screwed.

    Comment


    • #3
      I understood about 33% of what you said. Either way, sounds like a shitty situation!!!

      Drink your name and relax!

      Comment


      • #4
        Ouch! A hard shutdown during boot up is among the worst thing that can happen, plus the EM pulse from the lightning itself. Small wonder the important stuff was wiped.
        May I suggest a session with NTDSUTIL, if you haven't done that already, to check for skeletons in the AD closet, like the old DCs. With a corruption like this, they most likely didn't got removed properly, even if you denoted them properly. The FSMOs are fucked up too, most likely.

        Hmmm, having a virtual machine ready is a good idea, I'll suggest cloning our main DC into Vmware to our IT team. Plus a daily systemstate and we should have at least a DC up and running so people can get some work done (or surf the internet).

        After that week you really deserve a drink or three!
        No trees were killed in the posting of this message.

        However, a large number of electrons were terribly inconvenienced.

        Comment


        • #5
          Quoth BeeMused View Post
          May I suggest a session with NTDSUTIL, if you haven't done that already, to check for skeletons in the AD closet, like the old DCs. With a corruption like this, they most likely didn't got removed properly, even if you denoted them properly. The FSMOs are fucked up too, most likely.

          Hmmm, having a virtual machine ready is a good idea, I'll suggest cloning our main DC into Vmware to our IT team. Plus a daily systemstate and we should have at least a DC up and running so people can get some work done (or surf the internet).
          Oh yeah... ntdsutil files info showed all files wiped out, 0.0 Kb. Ntds folder under systemroot was gone... not wrong or corrupt, friggin gone. SAM wouldn't initialize for obvious reasons as well. I have the (old) main DC completely unplugged from the network and the only way to get a boot at all is to boot into directory restore mode, which won't allow me to run dcpromo to kill it or remove the roles. FSMO was jacked, but we used another regions DC to seize all 5 FSMO's to at least get things going. Once I rebuilt the former backup DC, we seied them back and all is well. I would just wipe and rebuild the stupid primary but we had our system for managing door access on there and I have been ordered to recover it rather than wipe and rebuild. Systemstate would have saved my arse had the tape library cooperated.

          Thank goodness for the spare blank VMWare machine, as it allowed us to get people working faster, but it doesn't seem VM's can handle the job of a DC. Who knows, something on that spare machine might have corrupted during all hell breaking loose too. If I hadn't been a drinker before, this week would have driven me to it. Bud light and football are easing my pain atm.

          Comment


          • #6
            *Passes over a Guiness*

            There have a real beer.

            Comment


            • #7
              Hey! I love good beer, but it is damned expensive to drink good beer all the time.


              *takes Guiness and drinks merrily*

              Comment


              • #8
                Quoth Mmmm_Beer View Post
                Oh yeah... ntdsutil files info showed all files wiped out, 0.0 Kb. Ntds folder under systemroot was gone... not wrong or corrupt, friggin gone. SAM wouldn't initialize for obvious reasons as well. I have the (old) main DC completely unplugged from the network and the only way to get a boot at all is to boot into directory restore mode, which won't allow me to run dcpromo to kill it or remove the roles. FSMO was jacked, but we used another regions DC to seize all 5 FSMO's to at least get things going. Once I rebuilt the former backup DC, we seied them back and all is well. I would just wipe and rebuild the stupid primary but we had our system for managing door access on there and I have been ordered to recover it rather than wipe and rebuild. Systemstate would have saved my arse had the tape library cooperated.
                We do our backups to disks, a systemstate of both DCs gets backuped to other systems, so we have them even if the main backup is f'ed up. We can still do a full backup of all servers over night, which is nice.
                Good luck with recovering the old DC, lemme guess it had all the FSMOs. Now Microsoft would tell you 'don't bring it back' because of that, yeah right. And you can't extract the door access system data, I guess?
                Your next week should be fun filled yet again. At least the network is running and people can work, so it shouldn't be as stressful as last week.


                Quoth Mmmm_Beer View Post
                Thank goodness for the spare blank VMWare machine, as it allowed us to get people working faster, but it doesn't seem VM's can handle the job of a DC. Who knows, something on that spare machine might have corrupted during all hell breaking loose too.
                Luckily we have not even 100 users, so a VM should be able to handle that, at least for a few hours until a real DC is back online. That's worth a test for sure, we have a machine dedicated to network and programming tests with 8GB ram, that should be enough to handle our DC (and DHCP and DNS and WINS (meh, but one app still needs it)) load.
                On the other hand, I'd be more worried about our databases, with those offline, we'd be in real trouble. I'm really paranoid about backing up those.


                Quoth Mmmm_Beer View Post
                If I hadn't been a drinker before, this week would have driven me to it. Bud light and football are easing my pain atm.
                Ewww Bud light... get a real beer!!!
                No trees were killed in the posting of this message.

                However, a large number of electrons were terribly inconvenienced.

                Comment


                • #9
                  Would a cold spare have helped at all?

                  Strong argument for making the UPS and the surge protector separate units there -- the SPs would have friggin' exploded, but at least the machines would have gotten a chance to shut down.

                  -E- I just noticed, power was out for THREE HOURS. No UPS can hold out that long. When the power came back and lightning hit the building, there was no charge in the batteries, so boom, disaster. I really feel sorry for you.
                  Last edited by roothorick; 08-30-2010, 12:14 AM.

                  Comment


                  • #10
                    A cold spare would have helped immensely but we didn't have the hardware. We also didn't have our backups in order (I have now been given the task of fixing them, yippee!) and we weren't as prepared as we thought we were for something like this.

                    Luckily, we are now far more fail-resistant and have backups and spares in line in the event it happens again. Still a crappy couple of weeks.

                    Comment


                    • #11
                      So, OUCH. More like FUCKFUCKFUCKFUCKSHITGODDAMMIT.

                      Big hugs for you and some more beer and home made nachos.

                      Can you write out a report on what all happened and why? And show the boss that IF WE DO X, WE GET THIS MESS. IF WE DO X+500, WE DONT GET THIS MESS.

                      Wonder why your backup stuff got toasted.
                      In my heart, in my soul, I'm a woman for rock & roll.
                      She's as fast as slugs on barbituates.

                      Comment


                      • #12
                        Quoth Cutenoob View Post
                        So, OUCH. More like FUCKFUCKFUCKFUCKSHITGODDAMMIT.
                        QFT.

                        I love nachos! My boss is actually willing to spend whatever it takes to have this never happen again, which is nice. Backup tapes were actually OK in the end but the server we use to run the software was toast for the same reason all of the others were. I now have a cold spare DC and blank server just in case. We are investing in an extra layer of surge protection and bigger UPS's.

                        Lesson learned; when you think you have enough protection, you don't.

                        Comment


                        • #13
                          What about generators? Or is the business OK to be offline for a few hours/half a work day? I know hospitals CANNOT go down, or banks (data centers for them at least) or ISP's. But depending on the business, you could do what you're doing (better hardware, spares) or even have a totally set up cold room.

                          Was this f-up all due to the tornado shit that hit TX?

                          Has the building maint been contacted, check your breaker boxes?? Hmm? What about seeing if the building could be grounded better than it is now? Wouldn't a better trip/breaker thing be able to handle a large surge coming from outside, trip and stop the lightning/magic from getting any further?

                          And how many times did you cuss, just curious
                          In my heart, in my soul, I'm a woman for rock & roll.
                          She's as fast as slugs on barbituates.

                          Comment


                          • #14
                            Quoth Cutenoob View Post
                            What about generators? Or is the business OK to be offline for a few hours/half a work day?

                            Was this f-up all due to the tornado shit that hit TX?

                            Has the building maint been contacted, check your breaker boxes?? Hmm? What about seeing if the building could be grounded better than it is now? Wouldn't a better trip/breaker thing be able to handle a large surge coming from outside, trip and stop the lightning/magic from getting any further?

                            And how many times did you cuss, just curious
                            Our building wouldn't allow us to install generators, which is why we are leaving at the end of our lease. There is nothing in our building that is life or death so we can't "demand" it.

                            This was actually caused by a small tornado directly hitting our block, but it wasn't part of the string of storms. Only people in the area really knew about it.

                            Maintenance is about as useful as tata's on a bore. They can't even keep their own major systems up let alone worrying about our specific ones. The main issue here was the shitty, shitty, shitty timing of it all. Power outage was no big deal because it happens all the time. Batteries drained and everything shut down peacefully. Power restored and everything begins to come back up nicely like it should. BOOM. Lightning wipes out all input power. The lightning only blew a few batteries in the end, after analyzing a little further it doesn't look like it made it past the UPS's. Not too big of a deal unless your servers are halfway through a boot. Then comes the "FUCKFUCKFUCKMOTHERFUCKINGCOCKSUCKINGPIECEOFMOTHER FUCKINGSHIT"

                            2,345,724 so far from this incident. I occasionally find more hidden away stuff that got fucked up so the number is still dynamic.

                            Comment


                            • #15
                              Quoth Cactus Jack View Post
                              Thats why you drink a couple of the good beers and then switch to the cheap beer after your standards are lowered.
                              I start low so I don't have to downgrade.

                              Comment

                              Working...
                              X