A look at the recent service instability

kakaroto · April 27, 2021, 4:33pm

Update- The Forge is still migrating service providers, but in the meantime we have found improved stability in dedicated servers via Digital Ocean. If you are currently experiencing issues with your games, they are likely unrelated to the issues mentioned below. We suggest contacting Forge Support for any ongoing problems. You can also attempt to troubleshoot by yourself with the following guides below:

Hi everyone,

I’ve wanted to discuss some of the issues and service disruptions we’ve had in the past weeks and explain the causes and the mitigations we’ve implemented to prevent such things from happening again.

First of all, I’d like to apologize for anyone who was impacted by this and I want you to know that providing a good and reliable service is our highest priority and we are taking all measures to make sure these types of issues are kept to a minimum. For those who have been with us for the past year, you already know that this is a rare occurrence, and for those who have only recently joined The Forge, know that this is not representative of our services.

Summary of the issue

The issues we’ve experienced recently is sudden lag and slowness and inability to access the website, or users getting disconnected from their game and being unable to get back into it. These things can happen, as the nature of servers and the internet is a fickle one, but for it to happen for extended periods (sometimes 30 minutes of lagging) and for multiple days in a row, is unacceptable.

I will delve into the technical details surrounding the issues, what measures we took to prevent the slow downs from happening again and how we’re going to be fixing them on the long term. If you are not interested in the technical details, then know that we are well aware of the issues, the major part of the issue is caused by our service provider and we are moving to a different provider because of that, and in the meantime we’ve set up a system to mitigate the issue. It is our highest priority to fix this, and it should already be fixed for the most part as evidenced by the last couple of days which went rather smoothly and with very few problems, despite being the most active times of the week.
So in short, don’t worry, we got this handled!

Technical details

I am a technical person and I know that while many Forge users are non tech-savvy people, there is also a large portion of our customers who are very knowledgeable about these things (so they really don’t want to deal with it themselves and choose us to do it for them), so I want to explain in details what the issues are, for the sake of transparency with our customers.

Let’s break it down to the main causes that we’ve determined, in order of importance :

DigitalOcean service provider not providing the service

Our service provider, Digital Ocean has been a relatively good provider for us for the past year. We’ve had a few hiccups, some networking issues, some server issues, but in the past 3 weeks or so, we’ve seen an increase in problems, specifically of the CPU load nature. DigitalOcean simply does not seem to have the capacity to provide service to all their customers, and their “Standard plans” which use shared CPU hosting have become unusable. Our CPU load is generally under 1.0, and the CPU usage floats around 10-20% so a shared CPU makes perfect sense, as we do not do AI deep learning/neural net, or video processing or anything CPU intensive, and when the CPU is used by other VMs, it’s usually a couple of percentages, maybe 5% extra that we lose.

Lately, whenever we’ve had people complaining about the Forge website becoming slow, I would check the servers and see a CPU load of 30 or 40, which is insane. Our general 20 milliseconds to respond to requests becomes 30 seconds or 50 seconds, or sometimes even times out after 2 minutes. The CPU usage gets to 80% or 90% of stolen CPU by the hypervisor to give to other VMs. To me, that’s not “shared CPU” that’s “unusable CPU”, and those issues would last for hours. I ended up keeping at eye on it too often and when I notice these machines becoming overcommitted (DigitalOcean selling more VMs than they have the physical resources to provide service for), then I would manually recycle the node so it moves the server processes to a different physical machine that isn’t overcommited. This week however, it appeared that all the physical hardware DigitalOcean had was overcommitted, even their “premium” ones, and when these problems start happening I was stuck. I would be load balancing over 20 VMs (compared to the usual 2 or 3), and they’d all be completely frozen with extremely high CPU load and unusable.

The solution we’ve currently temporarily implemented is to upgrade to using dedicated servers for our proxies, which are overpriced and expensive and completely unnecessary (apart from the fact that the standard plans are unusable). It feels to me personally, as though DigitalOcean is holding my service hostage by not providing the service we pay for, and saying “if you actually want it to be usable, you now need to pay for our overpriced plans” and that makes me unhappy and is the last drop in a bucket of increasing issues I’ve had with them.

As I’ve started to talk about these issues lately, I’ve heard from a couple of people that they’ve had similar experiences and are also looking to move away from DigitalOcean or have already moved their servers.
Now the big task ahead of us is to know where to move, and evaluating the various options available is not an easy task, especially with our own infrastructure requirements, and we’ve been evaluating them for a little while and no definite decision has been made quite yet. What I can say however is that I am confident that we’d be able to perform the move with no downtime, or maintenance window or anything of the sort.

Scaling algorithm/threshold tuning

Another factor is that the Forge will automatically scale its servers based on demand, so when traffic is high, we increase the number of servers we use to accommodate the sudden increase in traffic, and this has generally worked well enough, but there was still need to tweak things a little. The small fine-tuning that was necessary became a much larger issue with the above problems, as it would try to scale up when 50% of CPU is used, but only 10% of CPU is used (with 90% being stolen by noisy neighbors), so the algorithm couldn’t do its job well, or when it did it, it was with the expectation that the new server would be able to handle the increase load, which wouldn’t be the case because the new server would also be overcommited, so it would create a new server, which would be useless.

Changing to dedicated servers has helped a lot, but there was still some fine-tuning to do and I’ve tweaked the settings of the autoscaler to make it react much more quickly and keep all the servers under as little load as possible. I most likely went a bit far on the thresholds, playing it safe and making it use way more resources than needed, but that has done an excellent job and the Forge site has never been so fast and responsive!

Peak time & reloads

Another issue is that people often play at the same time. Let’s say Sunday 8PM, so it seems that most worlds will be accessed within a few minutes interval of 8:00, 8:30, 9:00, etc… this causes sudden short peaks at specific times in the day. When the traffic suddenly increases unexpectedly, the autoscaling algorithm will create new machines to handle the traffic, and those will take a minute or two to get provisioned, in the meantime, the site will be slow to respond for those couple of minutes until it finishes scaling up to handle the new traffic, that’s unfortunately unavoidable at this time (but the fine tuning I mentioned above would help react faster to minimize that impact, as it would scale up when the current load is still far from full capacity). Similarly, what we’ve had happen in the last couple of weeks is that there would be a sudden lag or interruption of service and everyone that notices their games not responding would reload the page, thus creating an even higher demand which makes it even harder to react, and then it becomes a vicious circle until the site is brought down by its own users.

Here’s a screenshot I took on Saturday (around 7:58PM EST) before I changed the autoscaler thresholds, it shows the time when the virtual machines had been created. You can see that a single machine was enough for a while, then a second one was created a few hour later, then 5 hours later a 3rd was needed, then at 6PM, two new servers were needed, then at 7:25, suddenly, 9 machines had to be created within a 2 minute interval

10 minutes later, we were back to 6 servers.
When that happened, a couple of people asked in Discord “is the site slow?” and it was (but still usable) and within 2 or 3 minutes, it was back to being super fast.

Storage cluster

This is not so much of an issue, but we did have two instances this month where the storage cluster had problems and caused servers to be inaccessible. In those instances, I cannot blame the interruption of service to DigitalOcean. Those issues were when some of you would see an error that it cannot start your game (as opposed to the site being extremely slow to load).
Those issues are on us, and I’ve been quick to react and fix them (thanks to @aeristoka and @Kevin calling and waking me when users started reporting the issue on Discord).
The first instance of the problem lasted for an hour or two, and was caused by one server misbehaving and blocking all access to the files. That interruption lasted for a while, as it would happen again and again as things got fixed, until I shut down that one server and everything worked fine afterwards. I still don’t know what caused it, but it could have been faulty hardware.
The second issue was a quick fix, the ceph cluster mount lost its connection to the cluster on one of the servers, it only affected a small subset of users, and was quickly rectified. I’d like to blame DigitalOcean again for this (it is quite possible) but I don’t know exactly why it lost the connection to the cluster, in either case, it was something out of my power to prevent, and all I could do was to react to it. One thing that makes these issues problematic is Foundry’s single-tenant design, which I’ll touch on in a second.

Foundry’s single-tenant process

This is a problem that we can’t do anything about unfortunately and is the cause for an inherent potential loss of stability. Let me explain: most online services will run through a cluster where your request could reach any of dozens or hundreds of servers and it would be handled and responded to. If one server has an issue, crashes, becomes unresponsive, etc… (things that can always happen for a variety of reasons), then your requests are still getting answered by another server. In the worst case, you seen some error, you reload the page, and now it works.
With the Forge site itself, that is already the case, and that system works well already, however, when it comes to the Foundry game itself, that’s not the case, because your don’t have dozens of Foundry servers able to handle any request from anyone; no, you have a single Foundry instance able to only handle requests for your specific game. This means that if that Foundry server freezes, the machine it’s on has an issue (loses connection to the rest of the network, or an error such as the storage problem mentioned above), then your Foundry game suffers from it, and if you reload the page, you still get sent to the exact same Foundry server. In those situations, it’s a bit harder to recover and we can’t redirect you to another server, because Foundry actually locks the folder so you can’t run a second server on another machine.

This brings with it a unique challenge that makes it much harder unfortunately to respond to situations where the users wouldn’t be affected by the occasional server issues that creep up.
I do have a possible way to mitigate this, as I would need to implement a recovery system that takes into account those situations and tries to act upon them, but that’s not a simple task and it might not handle all the use cases properly.
In the meantime, we ask users to “stop and start your Foundry server”, which forces the problematic Foundry server to be shut down and recreated on a different machine, and that usually is enough to correct the issue.

Conclusion

That’s about it, all the various components and issues and things that have piled on to cause the recent issues that we’ve had. I’m happy to say though that things are much more stable now and we’re working on moving the infrastructure to a different set of servers with a different service provider, which should eliminate all the problems for the long term hopefully, but we’ve already fine tuned our autoscaling algorithms and mitigated the current lack of service by DigitalOcean, so until the migration happens, we should be safe, and the migration should happen with no impact on existing games or any downtime.

Thanks for trusting us with your games, and we hope that this helps you better understand the situations we’ve had to deal with in the past weeks!

xaos · April 29, 2021, 1:52pm

Thank you very much for all your hard work and just as importantly your transparency on these issues. So glad I made the change last year from other VTTs.

leakimka · May 2, 2021, 2:45pm

Thanks @kakaroto for taking the time to give us this detailed explanation, much appreciate (and always interesting from a technical perspective)!
Now I know I’d rather start my games at XX:15 or YY:45 to avoid the “rush hour”!

Hope you’ll find a better provider soon, so we could play many more awesome games on your service!

Keep up the good work!

destinygrey · May 2, 2021, 8:31pm

Hello Leak, we’re glad you’re proactive about this, but we just wanted to make clear that we are paying extra at the moment, in order to make sure this doesn’t happen while we’re with the same service provider. So please feel free to schedule yourself for whenever works for your group.

leakimka · May 3, 2021, 6:34am

Hi @destinygrey, my comment on the “rush hour” was totally genuine and not at all sarcastic. I’m sorry if you understood it this way.
I just meant that it’s a good tip to know: if I start my game few minutes later / earlier (which won’t make a big difference) it could help to ease the charge on the server…and to optimize my players experience.
On my side, I rarely met any of the problems described by @kakaroto, except for an expired SSL certificate lately…and the issue was resolved in no time!
To sum up, I’m really satisfied with the Forge services !