Connectivity Issue - Dallas, Cloud Manager, API
Incident Report for Linode
Postmortem

On December 12, 2022, at approximately 0:05 UTC, monitoring systems alerted to reachability issues in the Dallas Data Center. At 0:10 an incident was raised, the incident response team was deployed and investigation began. A status page was published at 0:27 UTC once the scope of the issue was fully understood. By 0:55 it was identified that a large amount of broadcast traffic was impacting hypervisors and causing intermittent latency and reachability for a subset of Linodes- approximately 80-90% of the Dallas data center. Continued investigation revealed that most of the broadcast traffic was from unanswered ARP requests, resulting in a cascade and amplification of said broadcasts. Having identified the broadcast storms causing the network instability, our teams still needed to identify the underlying cause and a way to mitigate it.

After extensive investigation, at approximately 04:20 UTC, a filter was deployed to the fleet to stem these broadcasts, allowing hosts to regain CPU cycles and answer ARP requests on a consistent basis. The network then recovered and stabilized, and alerts began to resolve en masse by 4:32 UTC. After additional monitoring of this fix, the incident was moved to a resolution at 6:45 UTC.

The immediate fix was a temporary measure. However, it was further tested and implemented permanently. This was finalized on December 13, 2022 at 18:35 UTC.

Further investigation is ongoing to determine the root cause of this issue with a full awareness of the other recent incidents occurring in our Dallas data center. As upgrades to our infrastructure are being applied to our active network, processes are being put into place in order to ensure the continuous stability of our infrastructure. We anticipate that incidents such as these will be prevented from occurring in the future as we move forward with these efforts.

Posted Dec 21, 2022 - 19:03 UTC

Resolved
This incident has been resolved.
Posted Dec 13, 2022 - 06:45 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 13, 2022 - 05:17 UTC
Update
We are continuing to investigate the connectivity issue affecting our Dallas data center. We will provide additional updates once we have them available.
Posted Dec 13, 2022 - 04:04 UTC
Update
We are continuing to investigate the connectivity issue affecting our Dallas data center. We will provide additional updates as soon as possible.
Posted Dec 13, 2022 - 03:04 UTC
Update
We are continuing to investigate the connectivity issue affecting our Dallas data center. We will provide additional updates as soon as possible.
Posted Dec 13, 2022 - 01:46 UTC
Investigating
Our team is investigating a connectivity issue in our Dallas data center. During this time, users may experience connection timeouts and errors for all services deployed in this data center. Customers may also experience connectivity issues reaching Cloud Manager and the Linode API. We will share additional updates as we have more information.
Posted Dec 13, 2022 - 00:27 UTC
This incident affected: Regions (US-Central (Dallas)) and Linode Manager and API.