Block Storage Performance Issues – Newark

Incident Report for Linode

Postmortem

Incident Summary

‌

At approximately 15:56 UTC on August 20th, 2018, our monitoring systems alerted us to multiple network ports going down in one of our Newark, NJ block storage clusters. Due to redundancy within the cluster, no impact to block storage was seen at this time. After troubleshooting, engineers from the network and storage teams determined the problem was linked to a bad line card on one of the redundant switches in the block storage cluster. Customers utilizing block storage would not have seen any impact to their storage services during this time. At 17:15 UTC, a ticket was opened with our data center partner to move the bad line card from the impacted switch into a spare switch chassis for further troubleshooting. At 17:59 UTC, our monitoring systems reported many block storage nodes going completely offline. During the block storage cluster rebalance, client IO is not a priority, so customers would have seen severe impacts to their block storage services. Our engineers determined that our data center partner removed a line card from the redundant, not affected block storage switch, which in-turn brought down network connectivity completely for several block storage nodes. This event caused the performance of our block storage services to be severely impacted. A call was immediately placed to our data center partner to have them insert the line card back into the redundant chassis. Once completed, the block storage servers regained network connectivity at 18:42 UTC and client IO started to recover. Customers would have seen performance for their storage services start to improve at this time. The cluster reported a full recovery at 19:56 UTC, and no customer data was lost during this event.

‌

Summary Timeline

‌

15:56 UTC - Redundant line card in block storage cluster reports down, no customer impact.
17:15 UTC - After troubleshooting engineering team has DC partner remove bad line card.
17:59 UTC - DC partner removes a line card on the "good" switch causing a portion of the block storage nodes to go fully offline, client IO severely impacted.
18:42 UTC - DC partner reinserts good line card into redundant switch, client IO starts to improve.
19:56 UTC - The block storage cluster reports a full recovery, no client IO impact.

‌

Next Steps

‌

After a thorough review of our processes when communicating with DC partners we have added another step when working on production network equipment. When the DC partner is tasked with any work on production equipment that could impact customers we will now require the DC partner be live on the phone. The expected result is that if a mistake is made the rollback will be implemented in a much shorter period of time.

Posted Aug 29, 2018 - 21:31 UTC

Resolved

Performance of Block Storage volumes in Newark has remained consistent since we have corrected the issue, and we're confident that this matter has been resolved. We’ll be posting an RCA for this incident on our status page in the coming days.

Posted Aug 20, 2018 - 21:22 UTC

Monitoring

Performance with Block Storage volumes in our Newark data center has returned to normal. We will continue to monitor the situation, and we do not expect any future issues at this time.

Posted Aug 20, 2018 - 20:06 UTC

Update

We have taken action to correct the performance issues with Block Storage volumes in our Newark data center, and we believe performance will normalize for all customers shortly. We will provide another update here soon.

Posted Aug 20, 2018 - 19:29 UTC

Identified

We've identified the cause of this issue, and we're working as quickly as possible to restore normal Block Storage service in Newark.

Posted Aug 20, 2018 - 18:50 UTC

Investigating

Our team is investigating performance issues with Block Storage in our Newark data center. We will continue to provide additional updates as the situation develops.

Posted Aug 20, 2018 - 18:36 UTC

This incident affected: Regions (US-East (Newark)).