FAA: Sun box disk failure caused NOTAM database crash
System issues preflight notices to pilots on airports, airspace and security issues
The NOTAM (notice to airmen) system provides notices to airmen, or pilots, regarding airports, equipment and security issues. The system went down late May 22 and was back up at around 7 p.m. on May 23.
Because of the disk failure, information had to be delivered to pilots through local air traffic controllers and alternate systems, including a Web site set up to disseminate the most up-to-date information, said Barry Davis, manager of aeronautical information management for the FAA. However, flight safety was never a problem, the FAA said.
"What happened was the drive in an end-of-life Sun box failed in the middle of updating the information on the hard drive, so it screwed up the database," Davis said.
Davis said that was the beginning of the complications. Davis' team replaced the hardware and the drive on May 22, which got the system running again.
"We already had the equipment to replace [the box], we just hadn't done it yet, and that's why the hardware recovery was quite simple -- we just put the boxes in," Davis said.
But even then, the system was running slowly, or in a deteriorated mode, and it got so bad, Davis said, that his team decided to reopen the problem to see what was going on.
As the technicians were working to fix the database, they decided to go to the backup system. As they did that, they soon realized they had written the error over to the backup system and had corrupted that system as well, Davis said.
"So because we had already replaced the hardware and the drives, we just had to pull the latest information and extract it out of the [corrupted] database, then re-import it into the [new] database," Davis said. "Then we resynchronized all of the subsystems so everyone had the same database copy, and then we opened the gates up at 4:40 p.m. on Friday so that all of the information would come into the system."
Davis and his team spent the rest of that night monitoring the situation to make sure there were no other errors.
While the automated system was out, pilots and other affected organizations were able to get the latest information from a Web site set up for that purpose. Although everything was updated by 7 p.m. on Friday, Davis said the decision was made to keep the Web site up until midnight as a precaution.
Read more about Disaster Recovery in Computerworld's Disaster Recovery Topic Center.
- Server-side Caching for the VMware Admin vExpert David Davis weights in on how best-in-class server-side caching solutions can drastically improve storage performance and reduce latency without the addition of...
- Case Study: Extending DR Protection for Apps W/O Fixed Costs/Fees Find out how the city of Asheville, NC won the Global City on a Cloud Grand Prize from Amazon AWS for Best Practices...
- Pilot Light DR for Amazon Web Services Pilot light disaster recovery is a perfect use case for the cloud; CloudVelox offers Pilot Light DR for AWS--automated cloud-based disaster recovery for...
- 6TB Oracle Ecommerce Stack Deployed on AWS in 7 Days A Fortune 1000 company was told that it would take more than 6 months to deploy their ecommerce stack on AWS. CloudVelocity deployed...
- Is SQL Server AlwaysOn really as powerful? Tips and Tricks from the field With the introduction of AlwaysOn, Windows Clustering Services is now more critical than ever.
- Why Purpose-Built Backup Appliances? Seeking cost-effective data protection solutions that can handle the ever-growing expansion of data, organizations are frequently turning to purpose-built backup appliances (PBBAs). All Disaster Recovery White Papers | Webcasts