I heard an interesting tale from my boss this morning.
I little over a month ago, the landlord for the building that our data center is in, contacted us about some electrical repairs that they would like to do. They wanted to replace the building's main switch (Note: I don't know the exact details of this repair. I just know the end results of this action). Not a problem. As long as it is done in X minutes as that is how long our UPS will last.
So, three weeks ago, on a Saturday, they replace the switch. They find that they have a “ground fault” (not sure if that is what actually happened. Just what was relayed to me). They pull the switch and put the old one back. The spend some time looking for the problem. Can't find it.
They tell us that they are going to try again the next Saturday. Same thing. They put the old one back.
Last Saturday, they try again. This time, they decide that they are not going to be able to find the problem unless the leave the switch in. So they cut the power to the building, and didn't tell us.
We found out around 10:00 AM when the data center went dead. Eight hours later, they fixed the problem (I heard that they never found the problem, it just “went away”) and restored the power.
We were lucky. Our 240TB of disk arrays all came back online. Lost a number of drives, but none of the RAID sets had multiple failures. We did lose one network switch, two Load Balancers (servicing our most important application), a blade in one of our SAN switches, 4 or 5 internal hard drives on servers and the main logic board in our tape library. You have to remember that some of this equipment has not been powered off in over six years. I have only been there 5.5 years.
I had all my servers up by 10:00 PM Saturday. Spent Sunday deploying bandages to applications.
We had all of the customer facing applications running again by 6:00 PM Sunday. Spent most of today “bracing up” the bandages we put in place, getting Development and QA systems back online and restarting services that failed because the target servers were not online yet.
So...
How was your weekend?
I little over a month ago, the landlord for the building that our data center is in, contacted us about some electrical repairs that they would like to do. They wanted to replace the building's main switch (Note: I don't know the exact details of this repair. I just know the end results of this action). Not a problem. As long as it is done in X minutes as that is how long our UPS will last.
So, three weeks ago, on a Saturday, they replace the switch. They find that they have a “ground fault” (not sure if that is what actually happened. Just what was relayed to me). They pull the switch and put the old one back. The spend some time looking for the problem. Can't find it.
They tell us that they are going to try again the next Saturday. Same thing. They put the old one back.
Last Saturday, they try again. This time, they decide that they are not going to be able to find the problem unless the leave the switch in. So they cut the power to the building, and didn't tell us.
We found out around 10:00 AM when the data center went dead. Eight hours later, they fixed the problem (I heard that they never found the problem, it just “went away”) and restored the power.
We were lucky. Our 240TB of disk arrays all came back online. Lost a number of drives, but none of the RAID sets had multiple failures. We did lose one network switch, two Load Balancers (servicing our most important application), a blade in one of our SAN switches, 4 or 5 internal hard drives on servers and the main logic board in our tape library. You have to remember that some of this equipment has not been powered off in over six years. I have only been there 5.5 years.
I had all my servers up by 10:00 PM Saturday. Spent Sunday deploying bandages to applications.
We had all of the customer facing applications running again by 6:00 PM Sunday. Spent most of today “bracing up” the bandages we put in place, getting Development and QA systems back online and restarting services that failed because the target servers were not online yet.
So...
How was your weekend?
Comment