Service Issue - Managed Databases

Incident Report for Linode

Postmortem

On August 25, 2022 at 01:33 UTC, Linode’s Support team received multiple automated alert messages about Managed Database clusters reporting a failure state. After a brief initial investigation, the problem was verified at 01:36 UTC, with deeper investigative work continuing by 01:38 UTC.

Linode formally initiated its Incident Response Procedure at 01:44 UTC, publishing a status page at 01:56 UTC. The investigation revealed the issue’s root cause to be the failure of an internal messaging service due to insufficient computing resources on its host systems. This messaging service is utilized when provisioning new Managed Database clusters, resizing existing clusters, and internal cluster failover, but would not have affected the ongoing SQL operations of Managed Database clusters.

Once these host systems received additional resources, the messaging service recovered at 02:19 UTC. To verify the operability of the messaging service, Linode deployed a new database cluster at 02:22 UTC, which successfully completed at 02:55 UTC. This is an expected amount of time for the full provisioning of a new Managed Database cluster.

The status page was updated about this successful fix at 02:59 UTC. Afterwards, an extended monitoring period took place to ensure that no other issues were occurring. All systems for Managed Databases remained operational throughout this period, and the incident was considered fully resolved at 03:59 UTC.

To prevent this type of issue from occurring again, we will be taking a number of actions:

We will be improving monitoring for the messaging service to detect this type of failure in the future.
We will be upgrading the messaging system itself during a scheduled maintenance and reviewing resource allocation for the messaging system.

Although it was not directly within the scope of this incident, we will also be reviewing our personnel protocols for internally communicating maintenance and issues related to our Managed Database service. This will ensure that we can take quick action on any future issues which affect its serviceability.

‌

Timeline:

Aug 25 01:33 UTC: Support Leadership raises awareness of new clusters failing to provision
Aug 25 01:36 UTC: Serviceability issue confirmed internally
Aug 25 01:38 UTC: Work begins to remediate the issue
Aug 25 01:56 UTC: Status page created with Investigating status
Aug 25 02:19 UTC: Messaging service responsible for outage starts working again
Aug 25 02:22 UTC: New cluster for testing the fix begins creation process
Aug 25 02:55 UTC: Cluster creation completes successfully
Aug 25 02:59 UTC: Status page set to Monitoring status
Aug 25 03:59 UTC: Status page set to Resolved status

Posted Sep 14, 2022 - 21:35 UTC

Resolved

This incident has been resolved.

Posted Aug 25, 2022 - 03:59 UTC

Update

We are continuing to monitor for any further issues.

Posted Aug 25, 2022 - 03:05 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 25, 2022 - 02:59 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Aug 25, 2022 - 02:35 UTC

Investigating

Our team is investigating a service issue affecting the Managed Databases service. During this time, users may experience issues when attempting to use these systems.

We will share additional updates as we have more information.

Posted Aug 25, 2022 - 01:56 UTC

This incident affected: Managed Databases.