BA's May bank holiday IT outage: what went wrong?

British Airways was forced to ground thousands of flights over the May bank holiday weekend after a "power surge" caused damage to one of its UK data centres at 9:30am on Saturday morning. This led to major systems for baggage, ticketing and check-in being taken offline.

Customers were left in the dark as contact centres were similarly affected.

The IT outage caused chaos at airports as BA cancelled all flights out of London Heathrow and Gatwick airports, affecting an estimated 75,000 passengers. The IT failure will cost the airline around £100 million in compensation, and untold reputational damage.

In an official statement BA said the outage was caused by "a power supply issue at one of our UK data centres. An exceptional power surge caused physical damage to our infrastructure and as a result many of our hugely complex operational IT systems failed".

BA categorically ruled out any chance of a cyber attack being responsible for the issue.

The main question in the aftermath is how an organisation like BA, on one of the busiest travel weekends of the year, could not have the redundancy in place to protect against an issue like this?

In his first interview since the systems failure, BA CEO Alex Cruz said: "There was a power surge and there was a back-up system, which did not work at that particular point in time."

Angela Eager of TechMarketView wrote: "A power surge could well have taken out power supply units across the infrastructure and if there was no software enabled provision to switch to a replica server or if it failed for some reason, then chaos would be the expected result."

Although it is early days, and the whole picture of what occured is incomplete, Andy Lawrence, research VP for datacenters and critical infrastructure at 451 Research wrote that "datacenters are designed to deal with problems of this nature, including BA’s. Some systems in the power chain clearly failed to perform as expected".

"It is clear that BA has been grappling not with one problem, but several," he continued. "Starting with the power supplies, but extending to the network/messaging systems, and to the database/application design. Recovering from all these issues, when they extend across multiple teams, and involve multiple contractors, is challenging and requires well-oiled processes."

BA runs a complex number of key systems, including a customised version of the Amadeus check-in system called FLY. Some systems are managed by BA and others by its IT outsourcing partner Tata Consulting Services in India.

The airline has also been using UK vendor Sunbird’s dcTrack Data Center Infrastructure Management (DCIM) solution since 2013. Keith Bott, service manager at British Airways said at the time: "The new DCIM software allows us to quickly allocate space for new servers, manage power and network connectivity, issue work orders and provide capacity planning across all British Airways data centers."

The GMB union has blamed the problem on technical staff being outsourced to Tata Consulting Services in India. However, Cruz refused to point the finger at its outsourcing partners, saying there had been "locally hired" staff attending to the maintenance and running of the infrastructure, which was in the UK.

The airline says that it is now "undertaking an exhaustive investigation to find out the exact circumstances and most importantly ensure that this can never happen again".

Copyright © 2017 IDG Communications, Inc.

Shop Tech Products at Amazon