December 28th issues with Assets

kakaroto · December 28, 2020, 11:44pm

Hi everyone,

Today, there was an issue with the S3 storage provider that the Assets Library is relying on, and which was out of our control.
The issue was causing a problem for people trying to access their assets, or upload new images to their assets library. This issue also affected content from the Bazaar, making some modules or systems fail to load all their files, which could have caused games to break beyond just having missing assets.
Thankfully, we caught and were able to mitigate the issue quickly, and there shouldn’t be any problems any more, but we’re keeping a close eye on things to react quickly if any more issues happen.

As for the details of the situation (as I know a lot of you are geeks like us, and like to know the technical details), the Assets Library is a Forge concept, and is a layer that abstracts away access to an S3 storage server where all the assets are actually stored.

The S3 Storage Provider, currently has issues with their DNS, where all the content that wasn’t already cached in the CDN, would fail to load because The Forge servers couldn’t access the S3 servers. That issue started at around 2020-12-28T18:45:00Z. By the time, we understood that an issue was happening, that it was widespread and persistent, and tracked down the root cause of it, it was already 2020-12-28T19:30:00Z.
The cause was thankfully quite simple, the S3 servers have disappeared from the DNS records for some reason, so whenever The Forge Assets downloader (a service that is part of the Assets Library stack, which acts as an in between proxy between the CDN service and the S3 service) tried to download the file from S3, it was failing to even resolve the URL.

The solution was thankfully quite simple, after figuring out how to add a custom host alias to the Kubernetes deployment, I’ve set a DNS override for the S3 URL to the IP address it should be (which thankfully, my chrome browser had cached somehow, so I was able to open a link to it and see which IP it was connecting it).
Once that was deployed, all assets were back in working order and that happened around 2020-12-28T19:40:00Z.

Unfortunately, I missed the fact that I needed to do the same hack on the Forge Assets uploader service (a different portion of the stack, running as a separate Kubernetes deployment, which handles uploads), until people started to report issues uploading some files. This was a less important critical as it only affected upload, and only upload of unique files, as The Forge has optimizations in place so existing files in the S3 storage were getting copied server side instead of uploaded separately, so it took a little while before we noticed that this problem was happening. This was fixed at 2020-12-28T21:48:00Z

The issue started over 5 hours ago, and the S3 service provider is still having issues (we’re only working around them for the time being), this is their current status page :

We’re keeping an eye on it, so once they are back to working order and we can confirm the issue with the DNS is fixed, we’ll be removing the temporary workaround of hardcoding the DNS entry in the /etc/hosts file in all our deployments and servers.

I’d like to give a big thank you to @aeristoka who has been the main fire juggler today on Discord and handled everyone’s questions and concerns and made me aware of the problem so I could take care of it.

I hope this helps everyone understand what happened today, and I hope it wasn’t too disruptive to your games,
Thanks everyone, and happy rolling!