Connectivity Issues - Newark
Incident Report for Linode
Postmortem

Postmortem Summary

At approximately 12:00:00 EST on February 21st, 2019, our monitoring systems detected widespread networking issues in our Newark, NJ data center. It has been determined that a power feed to one of the redundant data center routers (Router1) was interrupted. Router1 has full chassis power redundancy and runs in a redundant configuration with Router2. When Router1 came back online and reformed its adjacency with Router2, instability with some traffic flows were detected. Engineers immediately started to troubleshoot the impacted router and isolated the problem to a corrupted neighbor table on Router1. The table was flushed and service was restored.

Timeline of Events

12:00:00 EST - A-side power on Router1 interrupted

12:05:00 EST - Linode Network Operations alerted to widespread network-related data center outage

12:06:00 EST - Incident response plan activated

12:15:00 EST - Router1 back online, reachability issues in the DC still apparent

12:25:00 EST - VPC consistency verified on router pair

12:35:00 EST - FIB consistency verified on router pair

12:50:00 EST - Router1 isolated from WAN routing, no change to impacted connectivity

13:00:00 EST - Router2 isolated from WAN routing, no change to impacted connectivity

13:25:00 EST - Router2 adjacency table flushed, no change to impacted connectivity

13:30:00 EST - Router1 adjacency table flushed

13:35:00 EST - Service restored

Further Follow Up Still Needed

It is still not clear why we experienced a prolonged outage when a router was removed from the redundant pair. These routers have sustained many reboots during upgrades and are designed to maintain functionality when one is dropped from the pair. It is also not clear why it was necessary to flush Router1's adjacency table to restore connectivity when it came back up. Linode Network Operations plans to replicate the Newark environment in our lab and work with Cisco to find the root cause of these multiple failures.

Posted 6 months ago. Feb 22, 2019 - 16:09 UTC

Resolved
We have been able to correct the issues affecting our Newark data center. We will be closely monitoring connectivity in the Newark data center to ensure our services remain stable. A full post-mortem of the event will be available at a later date.
Posted 6 months ago. Feb 21, 2019 - 20:31 UTC
Monitoring
We have been able to correct the issues affecting our Newark data center. We will be monitoring this issue to ensure our services remain stable.
Posted 6 months ago. Feb 21, 2019 - 19:05 UTC
Update
Our team is still investigating this issue. We will continue to provide additional updates as the issue develops.
Posted 6 months ago. Feb 21, 2019 - 18:36 UTC
Update
We're continuing to work to restore normal connectivity in our Newark data center, and we'll continue to provide updates here.
Posted 6 months ago. Feb 21, 2019 - 17:56 UTC
Investigating
We are aware of connectivity issues affecting Linodes in our Newark data center and are currently investigating. We will continue to provide additional updates as this incident develops.
Posted 6 months ago. Feb 21, 2019 - 17:18 UTC
This incident affected: Regions (US-East (Newark)).