Service Issue - Object Storage - Newark
Incident Report for Linode
Postmortem

In October and early November, Linode experienced five incidents that impacted the availability and performance of our Object Storage service. We’ll cover the causes behind these incidents, the steps we have taken to resolve the issues, and the steps we’ve taken to prevent future occurrences.

Incidents on 10/10 and 10/16 in Newark

Hardware issues resulting in 504 Gateway timeouts

Link to 10/10 incident: https://status.linode.com/incidents/9hz03j85m94w

Link to 10/16 incident: https://status.linode.com/incidents/4343n25b03nf

During the week of 10/10, Linode Object Storage suffered two incidents resulting in reduced performance and availability. These two incidents were related, with the second incident on 10/16 occurring while we were still investigating the previous incident on 10/10.

During our investigation, we discovered hardware issues with one of our back-end Object Storage nodes which manifested into service-wide issues. Since this was not an outright hardware failure, back-end infrastructure was flapping up and down. This inconsistent stability on our back-end infrastructure caused excessive recovery operations in the cluster, resulting in degraded performance, and in certain cases, caused our front-end gateways to return 504 gateway timeouts to customers.

At 10:31 UTC on October 16, an outright hardware failure occurred with our back-end Object Storage infrastructure. This required us to remove it from the cluster. Once removed, the cluster recovered, client IO was no longer bottlenecking, and our front-end gateways stabilized and began serving clients normally.

Steps we’re taking

This incident highlighted some gaps with our internal monitoring systems. We are working to improve detection mechanisms, so we can identify and triage subtle issues relating to failing or misbehaving hardware quickly before they impact the wider cluster.

We have also made changes in our monitoring to more accurately describe certain Ceph-related events to our on-call administrators. Our alerting is now more clearly defined to help administrators narrow down issues more quickly.

Incidents on 10/19 and 10/20 in Newark and Atlanta

Scaling and rate limiting issues resulting in performance degradation and 503 Slow Down messages

Link to 10/19 incident: https://status.linode.com/incidents/lgyk3qmvhy20

Link to 10/20 incident: https://status.linode.com/incidents/gqn9s8wwklvl

On 10/19 and 10/20 in our Newark and Atlanta data centers, we saw extremely high inbound traffic in both the total number of requests and throughput/bandwidth to our Object Storage clusters. As a result, we identified and subsequently triaged multiple issues. These ranged from: undesirable tunings and configuration with our back-end cluster nodes, overloaded front-end gateways & inadequate resources to serve requests, deficiencies in our rate limiting implementation, and even hitting bugs within Ceph.

During these incidents, customers saw degraded performance and reduced availability, as well as an increase in rate limiting/throttling in the form of “503 - Slow Down” HTTP response codes. Additionally, internal systems that depended on our Object Storage service were being rate limited, which impacted the functionality of Object Storage interactions in the Linode Cloud Manager interface.

Early on during our investigation, we uncovered multiple issues that in totality contributed to the underlying incident. First, it was discovered that the front-end rate limiting at the edge of our platform was not working as intended, and thus it was allowing substantially more inbound traffic than it should have been. This caused issues for customers in those Object Storage clusters by allowing excessive load to reach our internal Object Storage infrastructure, which subsequently kicked in our back-end rate limiting to result in 503 - Slow Down messages.

Second, we observed excessive memory consumption with our internal Object Storage during these incidents, which contributed to the performance and reliability issues. Our investigation led us to find a bug tracker item for Ceph which accounts for this abnormal behavior. Ultimately, it was discovered that certain configuration tunables could lead to excessive memory consumption during storage pressure events, such as the aforementioned incidents. 

A third issue directly related to the rate limiting (HTTP 503 - Slow Down) messages was also uncovered. During periods of high peak load, we found that the internal Object Storage infrastructure was at times exceeding two cluster safety thresholds:

  1. The maximum number of concurrent requests we allow the back-end cluster to handle before these infrastructure components start issuing 503 - Slow Down responses. 
  2. The maximum throughput (bandwidth) we allow the back-end cluster to sustain before these infrastructure components start issuing 503 - Slow Down responses

It was determined that these back-end tunables were too conservative and we were prematurely rate limiting clients. Additionally, once rate limiting begins on these infrastructure components, it affects all clients, even those who have not exceeded our advertised throttling limits. If our normal per-bucket front-end rate limiting is working as intended, these indiscriminate back-end rate limits should never come into play, hence the changes to tunables. Internal Linode systems that relied on Object Storage were also intermittently rate limited, which prevented some customers from listing their buckets in the Linode Cloud Manager. 

Finally, we discovered that some of our internal Object Storage infrastructure was being affected by resource contention, which contributed to inconsistent performance for customers.

Steps we’ve taken

We’ve fixed the rate limiting enforced at the edge of our Object Storage clusters. We were previously allowing clients to do considerably more I/O than the rate limits we advertise (750 requests per second, per bucket): https://www.linode.com/docs/products/storage/object-storage/

Our rate limiting systems now accurately reflect the advertised throttling value of 750 requests/sec.

We've addressed a Ceph bug that was causing excessive memory consumption on some portions of our internal Object Storage infrastructure. We’ve adjusted our tunables so that we don’t experience runaway memory during high-pressure storage events, like cluster recovery due to a hardware or network failure.

We addressed performance issues with some of our internal Object Storage components impacted by resource contention. We’ve placed these internal Object Storage components onto less busy, more stable, and higher-performance infrastructure.

Finally, we’ve adjusted a handful of performance tunables on both our internal Object Storage components and the back-end clusters so that we don’t prematurely rate limit customers.

Incident on 11/6 in Newark

Excessive traffic resulting in 503 Slow Down responses

Link to 11/6 incident page: https://status.linode.com/incidents/5wgmd45h7h1j (this page)

On 11/6, Object Storage in our Newark data center started returning 503 - Slow Down responses to customers after experiencing a very large amount of inbound traffic, similar to the previous incidents.

To mitigate the immediate impact of this issue, Linode identified and restricted some of this traffic and performed emergency maintenance on the Object Storage infrastructure, adjusting rate-limiting behavior and restarting key infrastructure components. These actions resolved the immediate impact of the issue.

To provide further resiliency against future impacts, Linode migrated the remainder of its internal Object Storage infrastructure to higher-performance hardware.

Steps we plan to take in the near future

While we believe our recent changes have stabilized the platform across all of our data centers, we still have plans to make our Object Storage platform more performant and scalable. Some upcoming changes in the works include:

  1. More refined throughput/bandwidth throttling
  2. Scaling up our internal Object Storage infrastructure to distribute the load more efficiently
  3. Exploring further opportunities for improving the hardware performance of our internal Object Storage infrastructure  
  4. Fine-tuning our load balancing algorithms to avoid hotspots in the clusters
Posted Nov 30, 2022 - 23:06 UTC

Resolved
We haven’t observed any additional issues with the Object Storage service, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
Posted Nov 07, 2022 - 01:57 UTC
Monitoring
At this time we have been able to correct the issues affecting the Object Storage service. We will be monitoring this to ensure that it remains stable. If you continue to experience problems, please open a Support ticket for assistance.
Posted Nov 06, 2022 - 22:15 UTC
Update
Our team is continuing to investigate this situation. We will keep you informed of updates as they become available.
Posted Nov 06, 2022 - 21:01 UTC
Update
Our team is still continuing to investigate the issue. We will share additional updates as we have more information.
Posted Nov 06, 2022 - 19:44 UTC
Update
Our team is continuing to investigate the issue in our Newark data center. During this time, users may experience connection timeouts and errors for all services deployed in this data center. We will share additional updates as we have more information.
Posted Nov 06, 2022 - 18:54 UTC
Investigating
Our team is investigating an issue affecting the Object Storage service. During this time, users may experience connection timeouts and errors with this service.
Posted Nov 06, 2022 - 17:28 UTC
This incident affected: Object Storage (US-East (Newark) Object Storage).