On August 25, 2022 at 01:33 UTC, Linode’s Support team received multiple automated alert messages about Managed Database clusters reporting a failure state. After a brief initial investigation, the problem was verified at 01:36 UTC, with deeper investigative work continuing by 01:38 UTC.
Linode formally initiated its Incident Response Procedure at 01:44 UTC, publishing a status page at 01:56 UTC. The investigation revealed the issue’s root cause to be the failure of an internal messaging service due to insufficient computing resources on its host systems. This messaging service is utilized when provisioning new Managed Database clusters, resizing existing clusters, and internal cluster failover, but would not have affected the ongoing SQL operations of Managed Database clusters.
Once these host systems received additional resources, the messaging service recovered at 02:19 UTC. To verify the operability of the messaging service, Linode deployed a new database cluster at 02:22 UTC, which successfully completed at 02:55 UTC. This is an expected amount of time for the full provisioning of a new Managed Database cluster.
The status page was updated about this successful fix at 02:59 UTC. Afterwards, an extended monitoring period took place to ensure that no other issues were occurring. All systems for Managed Databases remained operational throughout this period, and the incident was considered fully resolved at 03:59 UTC.
To prevent this type of issue from occurring again, we will be taking a number of actions:
Although it was not directly within the scope of this incident, we will also be reviewing our personnel protocols for internally communicating maintenance and issues related to our Managed Database service. This will ensure that we can take quick action on any future issues which affect its serviceability.
Timeline: