Inside AT&T’s Hurricane Sandy disaster recovery operation

AT&T, a world leader in telecommunications services, chose New Jersey as a base for its global network operations centre (GNOC) after carrying out an analysis that found it was an area in the US that was both easily accessible to key cities and the global markets, but was also considered a ‘low-risk area’ for earthquakes and hurricanes.

However, no location is 100 percent protected from natural disasters, which was proven true when Hurricane Sandy hit the east coast of the US the week beginning the 28th of October. Sandy is one of the largest hurricanes on record and caused mass devastation throughout the states of New Jersey and New York, not to mention across the Caribbean and Mid-Atlantic, which cost hundreds of people their lives and homes.

For AT&T this could have resulted in disaster for its network. It not only had to protect its mobile services, its core network, help its customers protect their infrastructure, support government rescue missions, but do it all from the GNOC in New Jersey, which was directly in the path of the storm.

Computerworld UKspoke to Chris Costello, AT&T’s VP of cloud strategy, who explained the scale of what was at risk during Hurricane Sandy, and during any other natural disaster that occurs globally.

“Business continuity and disaster recovery is a way of life for us. We transmit 33 petabytes of data across our network on an average day, we have over 9,000 network buildings, 211,000 satellite locations, 3,000 MPLS nodes, over 900,000 miles of network fibre – that’s a lot to protect, and so constant monitoring and testing is critical for us,” said Costello.

“AT&T’s network disaster recovery programme is in its twentieth year and we have put 125,000 work hours into it and invested over $600 million (£372 million). We have completed over 61 recovery exercises in the field, and we do testing on three or four different sited every year. We test and simulate as if an actual disaster is occurring.”

Weathering the storm

Costello explained that this constant appreciation within the business that disaster recovery is key to keeping the network up and running meant that when AT&T knew Sandy was going to hit the east coast of the US, it was prepared to deal with the fallout.

“The network disaster team was activated to provide emergency response communication to areas hit by Sandy on October 28th. We have over 2,00 responders providing support 24 hours a day, 7 days a week during this period. We dispatched over 3,000 generators, where we had 14,000 cell sites on generator at the peak of the storm,” she said.

“We activated our emergency management processes, our regional operation centres and two local response centres. We had cells on light trucks and terrestrial cells on wheels rolled into the area, which are essentially mobile cellular services that we deploy into the region so that the people who didn’t have any way to communicate, who had no power, could do so using the cellular communications on wheels.”

She added: “This helped not only the residents with their communication, but also the government emergency response teams during the disaster.”

Costello said that some of the cells on wheels, light trucks and emergency communications vehicles are still deployed in the hardest hit areas of New York and New Jersey, but during the entire storm AT&T didn’t suffer an outage on its core network. “We were fully prepared,” she said.

However, AT&T didn’t just have to worry about its network infrastructure, but also its datacentres and some of its customer’s datacentres. It did this by deploying additional equipment to potential disaster areas and by using its cloud based services for burst capacity.

“We had vendors that were located near to our data centres in case we needed to call on them on an emergency basis. Throughout the peak of the storm they were all staying in their hotels in the north east with their equipment. The key data centres we looked at during this timeframe were New York, New Jersey, Virginia and Massachusetts,” said Costello.

“They all performed according to plan. We did have to go on to generators at times, but everything ran smoothly.”

She added: “In terms of our customers in the north-east, some of them were able to call on their account team on an emergency basis, because they weren’t configured within their own data centres to properly fail over to another centre. So we worked with to provide them with capacity and resources within our datacentres.”

Unlike Costello, who was involved with disaster recovery from a strategic viewpoint, Robert Desiato is AT&T’s director of disaster recovery and is involved with the front-line operations of handling any major events – including Hurricane Sandy.

Desiato was in Los Angeles when he heard the storm was going to be hitting the New York area and was unable to get a flight out of LAX to the region, which resulted in him having to fly to Atlanta and then driving up with the disaster recovery trucks to get there just in time for Sandy to hit.

His team has 30 full time members, but also has called on up to 100 volunteers from with AT&T at any one time.

“Our trailers have all the equipment that’s in a central office on them and we can pull them around anywhere. Whichever parking lot we end up in becomes the central office. We don’t have to worry about a building, we have power in there, we have all the connections coming in, and we can recreate that central office,” he said.

However, Desiato did admit that with every major event lessons are learnt, and this was no different for Sandy.

He said: “Every time we respond to a disaster we learn something. We have been doing this for twenty years, but we learned a couple of lessons with Sandy – mainly that there were some holes in our documentation, which needs to be improved.

“We have got a standards team, which takes these issues and assigns them to somebody who tracks them to resolution.”

Evacuating the GNOC

Finally, Computerworld UK spoke to Steve Moser, AT&T’s GNOC visitor program manager. The GNOC is the centre of AT&T’s network management programme. It is where it monitors all of the traffic travelling across its network, identifying any problems that are occurring, and auctioning any resolutions. The GNOC is where AT&T was monitoring the impact of Sandy – until the power went out.

“There is a disaster recovery centre, which isn’t too far from the GNOC, which we did use during Hurricane Sandy, because commercial power was lost. We were out of power for a little over a week,” explained Moser.

“The building has its own generator, so we were able to continue our work on that for a while, but after a few days the generator failed and we had to operate on battery – which we can do for about an hour. That’s all the time we had to move out of her.”

He added: “There’s a procedure in place for getting everyone out of here and working from that disaster recovery site, which we managed to successfully do within that time frame.”

To view photos of AT&T’s disaster recovery mission for Hurricane Sandy click here.

Copyright © 2012 IDG Communications, Inc.

8 highly useful Slack bots for teams
Shop Tech Products at Amazon