The Case of the Overheating Datacenter

In the very early morning of Friday, March 15th, various sensors around the IT datacenter began sending out emergency notifications to IT and Facilities staff about temperature limits being exceeded. By 12:15AM, IT’s Systems & Networking Technician Dan Leich was in his car on his way to campus, and the other Systems & Networking staff were communicating via phone calls and text messages. Around 12:30 AM, Dan walked into Emerson’s central datacenter and got a nice face-full of 110-degree air. The two HVAC (Heating, Ventilation, and Air Conditioning) units, built to redundantly maintain a certain temperature and humidity in the room, had both failed. Soon, the IT staff gathered in an IRC chatroom, discussing the best way to proceed. They were soon joined by several other IT staff to help coordinate outside communication.

The datacenter’s HVAC system is built to be redundant. The HVACs are even smart enough to use outside air if it’s cold out, saving electricity. There’s two of pretty much everything, so that if one component fails, there’s an immediate backup. The system periodically switches between the two units to make sure they’re both working. On one such switchover event on Thursday night, the unit being switched to did not function properly, but because of a failed pump the system did not “failover” correctly to its backup. Instead, the whole system shut down, leaving the datacenter with no ventilation.

TURN IT ALL OFF!

From home, IT staff began turning off every system they could to help reduce the amount of heat being generated in the room. By 2:30AM, every non-essential system was offline, and Facilities set up a portable AC unit, but the temperature only dipped to 105 degrees Fahrenheit. At this point it was clear to Facilities staff that something major had failed on the roof, where the HVAC system took in air. Our third party HVAC technicians were at least an hour away, so IT staff began shutting down every essential system except for DHCP, DNS, ECWireless, and the internet connection.

By the end of the shutdown process, IT had turned off every storage array, every physical system, and three of the four Virtual Machine host servers. The only equipment left powered on were one VM host for DNS and DHCP, the ECWireless controllers, and the physical routing gear for the internet connection. Dan, being the IT staff present on site, had to physically unplug several systems that could not be turned off remotely. The temperature spiked again at 110 degrees Fahrenheit despite these efforts.

Luckily, our outside vendor technicians arrived on site around 2:30AM and diagnosed the problem. They manually re-engaged the backup HVAC, which began cooling the datacenter. By 4:00AM, the temperature was 85 degrees, an acceptable level to begin turning things back on. Turning things off in the right sequence is difficult, but turning them on can be even more dangerous.

We kept this on, thankfully.

Fortunately, there were no major failures while turning equipment back on. A few failed hard drives here and there, and a couple of grumpy pieces of hardware, but nothing catastrophic. Most every system was back up and reinitialized in the proper order by 5:30AM. Best of all, at no time was “the internet” disrupted, so there were no reports from students in the dorms about any problems. The off-and-on-again procedure followed by the IT staff was efficient and exemplified great teamwork, especially given that everyone (except Dan) was working from home, coordinating their efforts through an IRC chatroom. At the time of this writing, our HVAC redundancy has been restored, and everything is back to normal.

We are currently researching moving some, if not all, of our datacenter to a more secure, managed location, so this type of catastrophe never happens again. We are also working to leverage current external solutions like our new campus in LA and Amazon Web Services to further protect our systems and add redundancy.

Just another day in IT.

Leave a Reply