I should have picked up on the warning signs when I walked into the Indianapolis airport at 6:30 a.m. on Monday morning. A news crew was filming in front of Delta’s ticket counter, and a Delta representative was guiding everyone to check in at the counter because the check-in kiosks weren’t working. But I didn’t figure out what was going on until I got through security and saw the headlines on the news — Delta computer outage grounds flights worldwide.
It’s a sign that I work in the IT channel that my first thought wasn’t about how long I’d be stuck at the airport or if I’d be able to catch my connecting flight. My initial reaction was wondering what could have caused such a massive outage and then feeling bad for whoever was trying to fix it. Twitter seemed to agree.
This morning, I bet a lot of people who work in IT are saying – with an empathetic cringe – “I’m glad I’m not that #Delta guy”.
— Jenn (@buggazing) August 8, 2016
Somewhere, there’s a software or systems engineer saying “I told you so” #DeltaDown #Delta
— Bryan Robbins (@bryantrobbins) August 8, 2016
As I sat at the gate reading news stories about the system meltdown, it left me wondering what happened to Delta’s backup systems and disaster recovery plans. According to a video apology from Delta Airlines’ CEO, the problems were caused by a power outage at the company’s data center in Atlanta.
An update from Delta CEO Ed Bastian: pic.twitter.com/udNN0kzbKs
— Delta (@Delta) August 8, 2016
I wondered, though, shouldn’t a company—especially a company that large—have redundant backup at an offsite data center? How could a single power outage cripple operations worldwide? Again, Twitter seemed to agree.
Just did a report for client on importance of testing critical backup generators under load. Now stuck at Logan #Delta
— Edward Davis (@EdDavis3) August 8, 2016
@annekcampbell @Intronis No question. I bet #Delta will close the barn door after this!
— Edward Davis (@EdDavis3) August 8, 2016
Equipment failure
Delta flights started back up around 9 a.m., and I ended up being delayed by only an hour. Other travelers weren’t so lucky. By the end of the day on Monday, about 1,000 flights were cancelled as a result of the Delta outage, and thousands more were delayed. Complications continued into Tuesday as Delta cancelled an additional 530 flights while it attempted to resume normal operations.
The cause of the massive outage was initially unclear, Delta pointed to a power outage as the cause of the problems, but according to reports Georgia Power suggested that it was an equipment failure at Delta instead. And that turned out to be the case.
“Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power,” said Delta COO Gil West in a statement released Tuesday afternoon. “The universal power was stabilized, and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.”
Delta isn’t the first airline to experience a major disruption like this due to a technology failure. For example, the Wall Street Journal pointed to Southwest Airlines cancelling 2,300 flights in four days back in July after a router malfunction at its data center in Texas which forced the airline to reboot its entire system, something that takes 12 hours, and United Airlines grounding several flights last year after router issues cause network problems.
Placing blame
While the problems have caused some to question why Delta hasn’t moved to the cloud yet, others pointed to consolidation in the airline industry leading to companies that are too large and too dependent on dated legacy IT systems and equipment. And some suggested that a reliance on IT offshoring and a poorly tested disaster recovery plan had a role in how the crisis played out. As Robert Cringely of BetaNews put it: “Anything less than a 100-percent service backup isn’t disaster recovery, it is disaster coping.”
No matter what the underlying causes are, the system outage and prolonged recovery is going to be costly for Delta, both in terms of lost revenue and damage to their reputation.
MSPs should see this as a large-scale illustration of the importance of proper backup and disaster recovery and the dangers of relying on legacy systems. If one of your SMB customers’ systems go down, it might not strand thousands of travelers or make national news, but it’s an example you can use to help SMBs understand how vital these types of precautions are.
Photo Credit: Bulent Kavakkoru via Flickr.com. Used under CC 2.0 License.