Datacenter Outage - Fremont
Incident Report for Linode
Postmortem

At approximately 01:30 UTC, on May 30, 2015, the power utility (PG&E) experienced an outage affecting our Fremont datacenter. Seven of the facility’s eight generators started correctly and provided uninterrupted power. Unfortunately, one generator experienced an electromechanical failure and failed to start. This caused an outage which affected our entire deployment in Fremont.

PG&E was in contact and gave an initial ETR for restoration of utility power of 04:30 UTC. This was later revised to 05:00 UTC and then 06:30 UTC. Utility power was actually restored at 06:05 UTC.

The maintenance vendor for the generator dispatched a technician to the datacenter and it was determined that a battery used for starting the generator failed under load. The batteries were subsequently replaced by the technician. The generators are tested monthly, and the failed generator passed all of its checks two weeks prior to the outage. It was also tested under load earlier in the month.

The UPS system and its batteries did not suffer a failure.

As soon as the outage occurred, Linode engineers verified it was indeed power related and remained on standby for over four hours waiting for power to be restored. Critical Linode infrastructure was made operational immediately after power was restored and then customer Linodes were booted.

Several servers did not survive the sudden loss of power and needed individual attention. Linode engineers worked well after the power was restored in order to repair and make these systems operational again which involved both hot and cold spare components. We were able to recover every system.

Linode apologizes for this power interruption and any inconvenience it has caused you. We sincerely appreciate your business and are committed to providing the best service possible. Our colocation provider is in the process of reevaluating their maintenance procedures and adding additional tests for this battery condition.

Posted Jun 03, 2015 - 03:16 UTC

Resolved
At this time, all hardware issues in Fremont have been resolved, and all Linodes should be booting. If you're experiencing any issues booting your Linode, please open a support ticket.

Some Linodes in Fremont are experiencing network connectivity issues. If your Linode in Fremont is experiencing network connectivity issues, a reboot of your Linode should resolve them. If the issues persist after you've rebooted your Linode, please open a support ticket.
Posted May 30, 2015 - 17:13 UTC
Monitoring
We are still actively working with a couple servers and will open support tickets with affected customers. Utility power has been restored and the faulty generator is now fully functional. We will continue to monitor for issues and await an official RFO from the datacenter which we will post early next week.
Posted May 30, 2015 - 13:48 UTC
Update
We have about 13 damaged servers left and are working on transplanting their drives to hot-spare servers.
Posted May 30, 2015 - 09:08 UTC
Update
Most Linodes in Fremont should be booted at this time. We continue to work on the servers that were damaged due to the power outage.
Posted May 30, 2015 - 08:04 UTC
Update
Power was restored at approximately 11:10PM PDT and Linodes are booting. There may be several servers that need special attention due to the power failure and we are investigating those at this time.
Posted May 30, 2015 - 06:50 UTC
Update
Our upstream provider has provided a new estimated time to restoration of 11:30PM PDT.
Posted May 30, 2015 - 05:08 UTC
Update
Our upstream provider has provided a new estimated time to restoration of 10:00PM PDT.
Posted May 30, 2015 - 04:37 UTC
Update
At approximately 6:30PM PDT, the Fremont datacenter experienced a power utility outage. One out of eight generators also experienced an electromechanical failure. The estimated time to restoration is currently 9:30PM PDT. Linode has all hands on deck to get Linodes online as soon as power is restored.
Posted May 30, 2015 - 03:21 UTC
Update
We have received word from our colocation provider that there has been a power event in a section of the Fremont datacenter. The affected space is where a critical part of the datacenter network is located. The datacenter's electric provider is working with staff to restore power as soon as possible. Please watch this status page for further updates.
Posted May 30, 2015 - 02:25 UTC
Identified
We are aware of an issue within our Fremont datacenter and are investigating at this time. We will provide additional information as it becomes available.
Posted May 30, 2015 - 01:47 UTC