October 7th service disruption post-mortem

kakaroto · October 9, 2020, 12:00pm

Hi everyone,

I will start by offering my sincere apologies for the major disruption to our services that occurred last Wednesday, especially for users in the North American region. While we strive to do our best, these outages may happen and I’ve usually been quick to anticipate and solve any problems so far, but not this time, and I apologize for that. While the world of internet connection and server uptime improved a lot in recent years, it is still not a perfect landscape.

Keeping it short and sweet

A summary of the situation is that an unforeseen issue from our service provider on October 7th caused a crash of our storage cluster, rendering the entire service inaccessible on the North American servers. We have worked to remedy the situation and taken measures to prevent a similar situation from happening in the future.

At The Forge, we’re gamers just like you and service outages are hard to swallow–especially when it’s game time. For this reason, if your game was unduly impacted by the outage on October 7th, we’d like to offer a special one-time credit you can apply against your next bill. We know this can’t replace the enjoyment and memories your Forge VTT games bring, but we hope it helps restore your confidence in us. So if you were directly impacted by the outage and were forced to cancel your planned game, let us know.

I would also like to thank all of you who have reported the issue and have been so understanding of the situation. I know that such an outage can be frustrating but I haven’t seen any angry posts; all I’ve received is love and understanding from you all and that has made a huge difference for me. I am humbled by your patience, so thank you very much.

What happened?

It is hard to say with full certainty what exactly caused the problem in the first place, but our preliminary investigation results are showing that the service provider for the datacenter where The Forge servers are located (Digital Ocean) had network issues within their virtual private network, which caused a chain reaction of events leading to the storage cluster where all the data is stored to suddenly become unavailable. With the drives being inaccessible, all the servers were eternally frozen trying to read from the disk until they were unable to respond to anyone’s requests.

Technical details

For those looking to understand the more detailed technical aspect of the issue, keep reading, otherwise, skip this section.
The network latency issues in the private network had caused our Ceph storage cluster’s OSDs (Object Storage Device, basically the servers that store the data to the hard drives) to start slowly failing to communicate with one another. OSDs and other Ceph related services started showing slow heartbeat times (20, 30, 60, 120+ seconds) from one to the other until most of the cluster had trouble communicating.

The Ceph Manager process, whose task it is to keep the storage cluster in working order at all times, was unable to communicate with some machines and had marked them as “host is offline”. When that happens the Ceph Manager will start to duplicate the missing data from the replicated copies in the other OSDs until the host comes back online, or try to restart the OSD server if needed in case it thinks it was frozen. Unfortunately, in some very rare cases, I’ve seen it crash instead when it’s having difficulty communicating with other nodes of the cluster. It’s usually not a big deal because it will be automatically restarted and it can continue monitoring and repairing the cluster as if nothing had happened. In our case though, it crashed again because of the bug in the Ceph Manager code, it got restarted, then crashed again… 6 more times!

It seems very likely that in order to recover from the network issue, Ceph Manager stopped OSD processes to try and restart them because failing to communicate with the OSD could have been a network issue or it could have been that the process itself had frozen in an infinite loop, and since it had no way of knowing, it tried to recover from that situation by restarting the failing OSDs.

We believe that once the OSDs were stopped, before it could start them up again, the manager itself crashed (again). The manager did get restarted but it lost its state, so once it found OSDs being halted, it assumed that was done on purpose and went back to normal without trying to restart them anymore.

In conclusion, it seems that the restart task of the manager was interrupted midway and never finished due to a bug in the Ceph code itself. The end result was 7 OSD servers being stopped completely with nothing to boot them back up, and with nothing to serve data from, any process trying to access the files from the storage cluster would freeze waiting for data to become available.

What steps did you take to mitigate such an issue in the future?

There are two issues that happened on October 7th, the first one is what I’ve just explained above in great detail. It’s an issue with the storage cluster itself and there isn’t much I can do about it. The second issue is that this problem happened while I was asleep, and I am the only one with administrative access to the systems and the technical know-how of the entire infrastructure to correct the problem.

Storage cluster corrections

The first thing I did after I’ve restored the cluster to working order and finished my investigation was to update the entire cluster to use the latest Ceph 15.2.5 release which was released 3 weeks ago. I had already done an upgrade last month to 15.2.4 and I don’t know if this version fixes this particular bug but let’s hope it does. I’ve also subscribed the the ceph-announce mailing list so I don’t miss any new release announcements.

This was certainly a sort of freak accident, a race condition that never should have happened because the Ceph Manager’s entire task is to recover from issues like that and to make sure the cluster is always in working order. Unfortunately there’s no guarantee that it wouldn’t happen again, and long term, my plan was to move the infrastructure to Kubernetes which would be much more resilient and battle tested against these kinds of situations. The new infrastructure change was already on my Roadmap for those who have been following, so this doesn’t change.

Availability corrections

The second issue is my lack of availability and more specifically, the fact that the moderators and staff had trouble getting in touch with me. I had already been planning for setting up a system for them to reach me in case of emergencies but we hadn’t finalized that process, now we have and if anything similar was to happen again, our response time would be much much faster.
Ideally, I would like to have another person capable of handling potential server issues so I’m not the single point of failure, but it is difficult to find someone that could be trusted with access to the servers (trusted with people’s data, trusted not to make mistakes, trusted to know how to solve these situations, etc…)

I hope this helps clarify the situation and restore your confidence in us, if any was lost.

Thank you,
KaKaRoTo.