Network Connectivity - London
Incident Report for Linode
Postmortem

On the morning of January 20th, 2017, our London datacenter experienced three incidents of partially degraded internet connectivity–

  • 7:41 to 7:55 UTC (14m)
  • 8:26 to 8:36 UTC (10m)
  • 9:09 to 9:18 UTC (9m)

Our London datacenter is serviced by two leased geographically diverse dark fiber spans between our colocation facility and nearby points of presence. All of our transit and peering connections are backhauled from these two points of presence.

Each of these three incidents was caused by multiple brief losses of light on our dark fiber spans–

[otn2] 7:41:18 to 7:46:58 UTC (5m 40s)
[otn2] 7:48:29 to 7:51:30 UTC (3m 1s)
[otn1] 8:26:34 to 8:27:58 UTC (1m 24s)
[otn1] 8:28:19 to 8:28:34 UTC (15s)
[otn1] 8:28:59 to 8:29:13 UTC (14s)
[otn1] 8:29:19 to 8:29:26 UTC (7s)
[otn1] 8:30:13 to 8:30:24 UTC (11s)
[otn2] 9:09:53 to 9:13:09 UTC (3m 16s)
[otn2] 9:13:57 to 9:14:36 UTC (39s)
[otn2] 9:15:02 to 9:17:27 UTC (2m 25s)

As can be seen above, neither of the spans experienced failures at the same time; however, the rapidly changing link states caused a pathological failure case for BGP, exacerbating the service impact.

We have not yet determined the root cause of these light failures. We have verified that our own equipment was operating normally at the time. Additionally, there was no scheduled maintenance reported by any of our infrastructure providers during the relevant time periods.

In order to minimize the impact of similar incidents going forward, we have decided to prioritize the implementation of BGP bidirectional forwarding detection (BFD) with our peers, wherever possible. Additionally, we will be implementing stricter BGP flap dampening.

Alex Forster
Network Engineer, Linode

Posted Jan 20, 2017 - 18:28 UTC

Resolved
This incident is resolved. We will be posting a post-mortem shortly.
Posted Jan 20, 2017 - 18:14 UTC
Monitoring
Connectivity to our London datacenter has been fully restored. We'll continue to monitor this situation and provide updates as necessary.
Posted Jan 20, 2017 - 12:53 UTC
Investigating
We are currently experiencing connectivity issues within our London datacenter. Our Network Operations team is aware and is currently investigating.
Posted Jan 20, 2017 - 09:18 UTC
Monitoring
Connectivity has been restored and we are monitoring for any residual issues.
Posted Jan 20, 2017 - 08:53 UTC
Investigating
We are aware of an issue within our London datacenter and are investigating at this time. We will provide additional information as it becomes available.
Posted Jan 20, 2017 - 07:55 UTC
This incident affected: Regions (EU-West (London)).