On the morning of January 20th, 2017, our London datacenter experienced three incidents of partially degraded internet connectivity–
Our London datacenter is serviced by two leased geographically diverse dark fiber spans between our colocation facility and nearby points of presence. All of our transit and peering connections are backhauled from these two points of presence.
Each of these three incidents was caused by multiple brief losses of light on our dark fiber spans–
[otn2] 7:41:18 to 7:46:58 UTC (5m 40s)
[otn2] 7:48:29 to 7:51:30 UTC (3m 1s)
[otn1] 8:26:34 to 8:27:58 UTC (1m 24s)
[otn1] 8:28:19 to 8:28:34 UTC (15s)
[otn1] 8:28:59 to 8:29:13 UTC (14s)
[otn1] 8:29:19 to 8:29:26 UTC (7s)
[otn1] 8:30:13 to 8:30:24 UTC (11s)
[otn2] 9:09:53 to 9:13:09 UTC (3m 16s)
[otn2] 9:13:57 to 9:14:36 UTC (39s)
[otn2] 9:15:02 to 9:17:27 UTC (2m 25s)
As can be seen above, neither of the spans experienced failures at the same time; however, the rapidly changing link states caused a pathological failure case for BGP, exacerbating the service impact.
We have not yet determined the root cause of these light failures. We have verified that our own equipment was operating normally at the time. Additionally, there was no scheduled maintenance reported by any of our infrastructure providers during the relevant time periods.
In order to minimize the impact of similar incidents going forward, we have decided to prioritize the implementation of BGP bidirectional forwarding detection (BFD) with our peers, wherever possible. Additionally, we will be implementing stricter BGP flap dampening.
Alex Forster Network Engineer, Linode