On 2021-07-13 at 20:40 UTC, Linode's Network Operations team responded to alerts of networking issues occurring in our Newark data center. Upon investigation, the alert presented as some latency and packet loss between Linodes in the data center.
The Network Operations team was able to identify a problem with a primary core switch in a redundant switch pair that connects two pods. The team isolated the primary switch and continued to troubleshoot.
Around this time, the Network Operations team reached out to our switch vendor to assist in troubleshooting. An emergency maintenance notification was posted to our Linode Statuspage at 21:35 UTC. After the status page was posted, the Network Operations team prepared the troubled switch for a reboot. Post-reboot, the switch still exhibited the same behaviors. During this time, the Network Operations team continued to work with the vendor to identify the cause of these issues.
Shortly before 03:00 UTC, the secondary switch began exhibiting the same behavior as the primary switch. Based on the recommendation of the vendor the Network Operations team decided to revert to a different version of the operating system and prepared the primary switch to be downgraded and reprovisioned. At 03:00 UTC, the secondary switch also entered into a completely failed state, stopping all connectivity between the two pods. The failure on the secondary switch occurred during the reprovisioning of the primary switch.
Services were restored around 03:40 UTC following the completion of the downgrade and reprovisioning of the primary switch. The Network Operations team then isolated the secondary switch and completed the same action. We continued to monitor for any latency or packet loss in Newark, with services fully restored at 04:42 UTC.
We’re continuing to work with the vendor to determine the final cause, most likely a software bug. We had been running on this code for many months without issue.