Downtime Post-Mortem

Now that we’re back up and running, I thought it’d be interesting to go over what happened, how and why we were affected, and what we’ve been working on to prevent such a thing from happening again.

Like many companies, we are hosted on Amazon Web Services, and like many companies, we were hit hard when some of their services went down.

We’re in good company though! Reddit, Foursquare, Heroku, Imgur, Flipboard, and many other companies were hit pretty hard as well. Oh, and part of the actual Amazon Web Services administration console. I’m sure we’re all doing some strategizing tonight.

What Happened?

Today, around 10:30AM, Amazon’s EBS (Elastic Block Storage, their filesystem implementation for servers) and RDS (Relational Database Service) services experienced “degraded performance” and “connectivity issues.” Really, they just flat out became inaccessible. This took our RBCommons, the Review Board website, and, for a period of time, the RBCommons Blog.

Pingdom notified us right away that we were down, and we immediately investigated. It took a while to send out an update as to what happened directly to our users, because that list of users was in the RDS instance we couldn’t reach.

Now the goal of any service in the cloud is to be resilient against such breakages. The whole point is to be able to spin up new instances and just go. So why didn’t we just do that?

Well, before you can do that, your architecture must support it. And for that, your dependencies must support it. You can’t rely upon local filesystem state, for instance. If we were some traditional database-backed app, we’d be fine, but we work very closely with various libraries for accessing repositories (libsvn, for example) and with SSH keys (which OpenSSH looks for on a local filesystem). In fact, RBCommons and Review Board are more than happy to scale out as far as needed, but some of our dependencies are not.

New Architecture

We’ve been preparing for this type of outage, but we just didn’t get there in time. The two large problems, distributed SSH key management and Subversion configuration/cert storage, have been mostly fixed.

We wrote a new SSH interface that can deal with distributed storage for SSH keys. It’s compatible with Subversion, Git, Mercurial, and anything else we need it for. We’re no longer limited to the local filesystem storage for SSH. In fact, we don’t even need it. Any instance we spin up will have full access to the keys, and the keys can be made redundant, all while still being securely stored.

We’ve also worked around Subversion’s requirement for local filesystem access for cert storage and configuration. We now carefully store the configuration and certificates we expect for Subversion within the database and then ensure the local instance has that state before performing a request. We can bring up a new instance and not require any reconfiguration.

Both of these options together will allow us to move to a model where something like an EBS failure won’t hurt us very much. Unfortunately, both are still being tested and we weren’t ready to deploy just yet.

Oh the timing…

Deployment Strategy

Now that we’re back up and running, our focus for the week will be to get the new architecture in and to get everyone’s accounts migrated.

Once that is done, we can begin transitioning off of EBS and using what is called a Local Instance Store. This is a filesystem that exists only for that instance, while it is up and running. They’re populated when we deploy the instance and can safely disappear when we tear it down. This creates a nice little bundle that we can start up as many times as we like, in as many availability zones as we need.

The next step from there is to switch to a multi-availability zone deployment of RDS. This will require a bit more testing, but should increase our resilience against RDS failures.

We’re investigating deploying a service in front of our blog and RBCommons front page that will keep an off-site cache of the main pages so that people can still see the site content while the sites are down. This won’t allow you to review code, but at least you’ll know there’s a problem and see some indication of what’s going on.

We’re also looking into the best way to provide an RBCommons health status page to show you the current state of our services and a description of what’s going on, so you’ll have something to furiously reload to keep you informed.

Cloud infrastructure is going to go down from time to time. There’s no avoiding that. We’re hopeful that when that happens next time, none of you will notice.

Downtime Post-Mortem

Related

Christian Hammond