Emergency Network Maintenance - Atlanta
Incident Report for Linode

Atlanta partial network outage on February 21st, 2016

Summary

On Tuesday evening at 8:40pm EST (local time), a portion of Atlanta hosts experienced a total network outage for approximately two hours due to the failure of a redundant pair of distribution-layer switches.

Date: 2017-02-21
Outage Start: 8:40pm EST
Outage End: 10:30pm EST
Total duration: 1h 50m

Timeline

At 2pm on February 21st, 2017, our network monitoring alerted us to a partial control plane failure on a single distribution layer switch. Each of these switches runs in an active/active configuration with another peer switch. We determined at the time that this partial control plane failure was not causing any data plane forwarding issues, meaning that we had time to work with the switch vendor and schedule an appropriate maintenance window before taking action.

Later that day at around 7pm, our systems administrators were made aware of several incidents of unreachable Linodes within the Atlanta datacenter. These Linodes were all determined to be under the affected switch pair identified earlier, and the symptoms pointed toward intermittent, widespread switching failure.

Several members of the Network Operations team and Systems team conferred for approximately one hour. As we discovered the scope of the issue, a status page was posted at 8:15pm indicating that there was a major network hardware failure in progress.

After attempting several unsuccessful fixes, it was agreed that the most simple and straightforward solution to the failure would be to take the degraded switch offline. Under nominal operating conditions, the switch's peer was designed to seamlessly handle this kind of failure without traffic interruption. Another status update was posted at 8:30pm noting that this action was imminent, and power was cut to the degraded switch at 8:40pm.

Unfortunately, due to circumstances that we still do not not fully understand, the peer switch did not handle the failure gracefully, causing an extended network outage for hosts which were homed under this switch pair.

Some time was spent unsuccessfully attempting to coerce the remaining switch into a working state, but we were eventually forced to take both of the switches offline entirely, then bring them back online and rejoin them into an active/active pair. As of 10:30pm, the switches had both been brought online and were confirmed to be operating normally.

I would like to sincerely apologize to the customers who were affected by this extended network outage. We will be asking our switching vendor for a technical investigation into the cascading failures we saw during this incident, and we will apply any remedial fixes that are necessary in the near future.

Posted about 1 month ago. Feb 22, 2017 - 17:06 UTC

Resolved
Being that we have not experienced additional connectivity issues affecting our Atlanta data center, this matter is now resolved.
If you are still experiencing connectivity issues, please reach out to our Customer Support Team for assistance.
Posted about 1 month ago. Feb 22, 2017 - 04:52 UTC
Monitoring
Normal connectivity in Atlanta has been restored at this time, however we will continue to monitor things should any additional issues arise.
Posted about 1 month ago. Feb 22, 2017 - 03:25 UTC
Investigating
We are aware of connectivity issues affecting Linodes in our Atlanta data center and are currently investigating. We will update this post with any additional information as it becomes available.
Posted about 1 month ago. Feb 22, 2017 - 02:02 UTC
Update
We will be taking the affected piece of network equipment offline shortly. Because of our redundant switching infrastructure, we do not expect there to be any significant impact to customer traffic. However, customers may observe brief periods of increased latency or packet loss.
Posted about 1 month ago. Feb 22, 2017 - 01:31 UTC
Identified
We have identified a traffic routing issue with a distribution layer switch that serves a subset of physical hosts in Atlanta.
Posted about 1 month ago. Feb 22, 2017 - 01:16 UTC
This incident affected: Atlanta.