Connectivity Issue - Singapore
Incident Report for Linode
Postmortem

On 12-16-2020, at approximately 11:45 UTC, our monitoring systems alerted us to a connectivity issue in the Singapore data center. These monitoring systems indicated a large drop in traffic to and from the DC, however, no clear cause was apparent so the Network Operations team continued to troubleshoot. In our Singapore data center, we have a core switch pair in an active/active redundant configuration. The team was able to isolate the issue to the B-side switch, which had stopped forwarding traffic but did not trigger a failover. 

The team then decided to force a failover to the A-side switch and this immediately restored full connectivity to the data center. The networking team then proactively rebooted the B-side switch. 

When the B-side switch completed its power cycle, diagnostics checked out, so it was returned to service. The network operations team engaged the vendor for further investigation into the cause of the issue but there wasn’t anything that presented out of the ordinary other than the switch was discarding traffic. 

On 12-18-2020 at 16:32 UTC, again, our monitoring systems alerted us to connectivity issues in the Singapore data center. This time, though, the A-side switch had stopped forwarding traffic. 

A forced failover and power cycle of the affected switch brought the data center back to full service. We continued to work with the vendor, however, we were still unable to determine a root cause for these switches to stop forwarding traffic. 

We experienced the same issue two more times, on 12-20-2020 at 11:01 UTC, and again on 12-21-2020 at 16:38 UTC. 

Since there wasn't a clear solution as a result of our investigation with the vendor, we began operating under the premise that this was an unidentified bug in the code version we were currently using. Our team decided to schedule an emergency maintenance and proactively upgrade the code version of the switch to another recommended version. 

On 12-21-2020 at 21:00 UTC, an emergency maintenance was performed to upgrade the software of each switch. This maintenance was completed without downtime. 

We have not experienced any technical issues since the completion of this maintenance and we believe this issue to be fully resolved at this point. We are still working with the vendor to identify the underlying bug, however, a final cause has yet to be identified.

Posted Jan 05, 2021 - 16:28 UTC

Resolved
We haven't observed any additional connectivity issues in our Singapore data center, and will now consider this incident resolved.
Posted Jan 05, 2021 - 16:19 UTC
Update
We are aware of the successive connectivity issues in Singapore over the past week, and we want to acknowledge the impact they have had. As an update, we believe these issues are stemming from a software bug in our redundant aggregate switches. We have completed an initial emergency maintenance to address this bug, and we will continue to work with our vendor to identify a long term fix. We will keep this page updated with more information as we have it, and we will provide a post-mortem once the issue is resolved.
Posted Dec 22, 2020 - 01:05 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 21, 2020 - 20:42 UTC
Monitoring
At this time we have been able to correct the issues affecting connectivity in our Singapore data center. We will be monitoring this to ensure that it remains stable. If you are still experiencing issues, please open a Support ticket for assistance.
Posted Dec 21, 2020 - 18:01 UTC
Identified
Our team has identified the issue affecting connectivity in our Singapore data center. We are working quickly to implement a fix, and we will provide an update as soon as the solution is in place.
Posted Dec 21, 2020 - 17:41 UTC
Investigating
Our team is investigating a connectivity issue in our Singapore data center. During this time, users may experience connection timeouts and errors for all services deployed in this data center. We will share additional updates as we have more information.
Posted Dec 21, 2020 - 16:49 UTC
This incident affected: Regions (AP-South (Singapore)).