Downtime Post-Mortem

Now that we’re back up and running, I thought it’d be interesting to go over what happened, how and why we were affected, and what we’ve been working on to prevent such a thing from happening again.

Like many companies, we are hosted on Amazon Web Services, and like many companies, we were hit hard when some of their services went down.

We’re in good company though! Reddit, Foursquare, Heroku, Imgur, Flipboard, and many other companies were hit pretty hard as well. Oh, and part of the actual Amazon Web Services administration console. I’m sure we’re all doing some strategizing tonight.

 

What Happened?

Today, around 10:30AM, Amazon’s EBS (Elastic Block Storage, their filesystem implementation for servers) and RDS (Relational Database Service) services experienced “degraded performance” and “connectivity issues.” Really, they just flat out became inaccessible. This took our RBCommons, the Review Board website, and, for a period of time, the RBCommons Blog.

Pingdom notified us right away that we were down, and we immediately investigated. It took a while to send out an update as to what happened directly to our users, because that list of users was in the RDS instance we couldn’t reach.

Now the goal of any service in the cloud is to be resilient against such breakages. The whole point is to be able to spin up new instances and just go. So why didn’t we just do that?

Well, before you can do that, your architecture must support it. And for that, your dependencies must support it. You can’t rely upon local filesystem state, for instance. If we were some traditional database-backed app, we’d be fine, but we work very closely with various libraries for accessing repositories (libsvn, for example) and with SSH keys (which OpenSSH looks for on a local filesystem). In fact, RBCommons and Review Board are more than happy to scale out as far as needed, but some of our dependencies are not.

 

New Architecture

We’ve been preparing for this type of outage, but we just didn’t get there in time. The two large problems, distributed SSH key management and Subversion configuration/cert storage, have been mostly fixed.

We wrote a new SSH interface that can deal with distributed storage for SSH keys. It’s compatible with Subversion, Git, Mercurial, and anything else we need it for. We’re no longer limited to the local filesystem storage for SSH. In fact, we don’t even need it. Any instance we spin up will have full access to the keys, and the keys can be made redundant, all while still being securely stored.

We’ve also worked around Subversion’s requirement for local filesystem access for cert storage and configuration. We now carefully store the configuration and certificates we expect for Subversion within the database and then ensure the local instance has that state before performing a request. We can bring up a new instance and not require any reconfiguration.

Both of these options together will allow us to move to a model where something like an EBS failure won’t hurt us very much. Unfortunately, both are still being tested and we weren’t ready to deploy just yet.

Oh the timing…

 

Deployment Strategy

Now that we’re back up and running, our focus for the week will be to get the new architecture in and to get everyone’s accounts migrated.

Once that is done, we can begin transitioning off of EBS and using what is called a Local Instance Store. This is a filesystem that exists only for that instance, while it is up and running. They’re populated when we deploy the instance and can safely disappear when we tear it down. This creates a nice little bundle that we can start up as many times as we like, in as many availability zones as we need.

The next step from there is to switch to a multi-availability zone deployment of RDS. This will require a bit more testing, but should increase our resilience against RDS failures.

We’re investigating deploying a service in front of our blog and RBCommons front page that will keep an off-site cache of the main pages so that people can still see the site content while the sites are down. This won’t allow you to review code, but at least you’ll know there’s a problem and see some indication of what’s going on.

We’re also looking into the best way to provide an RBCommons health status page to show you the current state of our services and a description of what’s going on, so you’ll have something to furiously reload to keep you informed.

Cloud infrastructure is going to go down from time to time. There’s no avoiding that. We’re hopeful that when that happens next time, none of you will notice.

Read More

Bitten by downtime – Here’s what’s next

Updated 4:18PM: We came back up just over an hour ago, but were not totally reliable, as Amazon’s services were still under maintenance. Things seem to have now stabilized, and you should be able to use RBCommons reliably again.

 

A few minutes ago, we were alerted to the fact that RBCommons is down. We host on Amazon, and it seems the file storage backend we use is temporarily down, which they’re looking into. On the one-hand, it means they’re on it, and we should be back up soon. On the other, it means everyone is down right now.

This, of course, sucks.

I expect we’ll be back up pretty soon. In the meantime, here’s what we’re going to do to make this less of a problem in the future.

We’ve been preparing this past week on enhancing Review Board and RBCommons so that we can more easily scale out across more of what Amazon calls “availability zones.” We’re close, and when we’re done, we’ll be spreading out our services and adding some more redundancy so this won’t bite us again.

We’re also going to work to become less reliant on EBS, so that if it goes down again, it won’t impact us, or you.

If you were impacted by this today, e-mail us and let us know. We’ll work to make you happy.

Read More

New sleek redesign

Last night we deployed some new changes to our landing page, pricing page, and overall RBCommons site styles. The new landing page gives new users a small tour of the RBCommons features. We’ll update this as new features are introduced.

We decided it was time to give RBCommons a bit of a visual refresh, so this was the start of that. You’ll begin to see this on many of the pages. The code review sections are still as they were, but in time we’ll refresh these as well to better match.

Now unfortunately, this came at a cost. We had a temporary issue with registration that lasted a few hours. Our credit card integration broke and new teams couldn’t sign up. I’m very sorry if this affected you. If you were affected, please let us know, and try again.

Read More

Let’s Keep In Touch :)

RBCommons has been growing up. Once upon a time, we had only a few early teams signed up, and back then e-mail worked just fine for support and announcements. These days, it’s another story, and we really needed to give you all a better insight into what’s going on with RBCommons.

We’re now giving you two easy ways to do that.

The first is the new RBCommons blog (what you’re reading here!). You can easily subscribe to our feed, or just, you know, set this as your home page. Why not.

The second is the new @RBCommons Twitter account. Any new posts here will go there, along with any emergency announcements, or whatever we feel like posting.

And as always, if you ever have any questions or comments, you can easily contact us.

Say hi sometime!

Read More