RBCommons 2.0 is coming this weekend!

August 20, 2014December 28, 2017 Christian Hammond

Updated Saturday, 1:50AM PST: We had some issues with one of the new servers, and had to roll some things back temporarily. This is extending our maintenance window. Hopefully nobody will be too badly affected, but we’ll be down until approximately 5AM PST.

We’re making a huge update to RBCommons this weekend. The site will be down for up to 4 hours starting Friday at 11PM PST, as we begin our upgrade to the all-new RBCommons 2.0.

This new update is based on Review Board 2.0, and brings some major improvements to the dashboard, diff viewer, review request change histories, performance, and more. A few of the new features you can expect include:

Fewer full-page reloads
Faster load times
Better, more accurate interdiffs
Markdown input for all text fields
Indentation markers in diffs
Smarter moved line detection in diffs
A nicer dashboard, which better displays when changes are approved, or if they have pending issues still open
Bulk-closing of review requests through the dashboard
Easy posting of existing commits on your GitHub or Subversion repositories, right from the New Review Request page
Faster posting of changes using RBTools
Better display of exactly what changed in updates to review requests
High-DPI icons for those on Retina or equivalent displays
Review of text-based file attachments

That’s just a few of the features that this release will bring. We’ll go into more detail after everything’s deployed.

On top of this, we’re moving onto much faster servers, which should help with some of the growth spurts we’ve been hitting lately.

So wrap up your work before this Friday at 11PM PST (Saturday, 6AM UTC). We’ll be shutting down the servers for up to 4 hours as we move to the new servers and begin the upgrade. It shouldn’t take the full 4 hours, but we want to allow for any issues that come up.

This weekend’s upcoming server maintenance

August 16, 2014December 28, 2017 Christian Hammond

This weekend, we’re beginning a series of upgrades to our infrastructure that should resolve some stability issues we’ve periodically hit with our database server on AWS. It should also help to improve performance across the site.

This work will start Sunday, August 17th at 6AM UTC (that’s Saturday at 11PM PST for those in California). We’re blocking off two hours for the work, at which point the site will be down. It shouldn’t take nearly that long, though.

Going forward, we’re gearing up for a big update to RBCommons. Along with this, we’re planning some further hardware upgrades that should do a lot to further improve performance. We’re planning this for some time in the next two weeks. We’ll announce the details when we’re closer.

If you are worried that your team is going to be horribly impacted by this maintenance window, please let us know!

Updated Sunday, 12:33AM PST: Maintenance is complete, and we’re back up and running!

Downtime Post-Mortem

October 22, 2012December 28, 2017 Christian Hammond

Now that we’re back up and running, I thought it’d be interesting to go over what happened, how and why we were affected, and what we’ve been working on to prevent such a thing from happening again.

Like many companies, we are hosted on Amazon Web Services, and like many companies, we were hit hard when some of their services went down.

We’re in good company though! Reddit, Foursquare, Heroku, Imgur, Flipboard, and many other companies were hit pretty hard as well. Oh, and part of the actual Amazon Web Services administration console. I’m sure we’re all doing some strategizing tonight.

What Happened?

Today, around 10:30AM, Amazon’s EBS (Elastic Block Storage, their filesystem implementation for servers) and RDS (Relational Database Service) services experienced “degraded performance” and “connectivity issues.” Really, they just flat out became inaccessible. This took our RBCommons, the Review Board website, and, for a period of time, the RBCommons Blog.

Pingdom notified us right away that we were down, and we immediately investigated. It took a while to send out an update as to what happened directly to our users, because that list of users was in the RDS instance we couldn’t reach.

Now the goal of any service in the cloud is to be resilient against such breakages. The whole point is to be able to spin up new instances and just go. So why didn’t we just do that?

Well, before you can do that, your architecture must support it. And for that, your dependencies must support it. You can’t rely upon local filesystem state, for instance. If we were some traditional database-backed app, we’d be fine, but we work very closely with various libraries for accessing repositories (libsvn, for example) and with SSH keys (which OpenSSH looks for on a local filesystem). In fact, RBCommons and Review Board are more than happy to scale out as far as needed, but some of our dependencies are not.

New Architecture

We’ve been preparing for this type of outage, but we just didn’t get there in time. The two large problems, distributed SSH key management and Subversion configuration/cert storage, have been mostly fixed.

We wrote a new SSH interface that can deal with distributed storage for SSH keys. It’s compatible with Subversion, Git, Mercurial, and anything else we need it for. We’re no longer limited to the local filesystem storage for SSH. In fact, we don’t even need it. Any instance we spin up will have full access to the keys, and the keys can be made redundant, all while still being securely stored.

We’ve also worked around Subversion’s requirement for local filesystem access for cert storage and configuration. We now carefully store the configuration and certificates we expect for Subversion within the database and then ensure the local instance has that state before performing a request. We can bring up a new instance and not require any reconfiguration.

Both of these options together will allow us to move to a model where something like an EBS failure won’t hurt us very much. Unfortunately, both are still being tested and we weren’t ready to deploy just yet.

Oh the timing…

Deployment Strategy

Now that we’re back up and running, our focus for the week will be to get the new architecture in and to get everyone’s accounts migrated.

Once that is done, we can begin transitioning off of EBS and using what is called a Local Instance Store. This is a filesystem that exists only for that instance, while it is up and running. They’re populated when we deploy the instance and can safely disappear when we tear it down. This creates a nice little bundle that we can start up as many times as we like, in as many availability zones as we need.

The next step from there is to switch to a multi-availability zone deployment of RDS. This will require a bit more testing, but should increase our resilience against RDS failures.

We’re investigating deploying a service in front of our blog and RBCommons front page that will keep an off-site cache of the main pages so that people can still see the site content while the sites are down. This won’t allow you to review code, but at least you’ll know there’s a problem and see some indication of what’s going on.

We’re also looking into the best way to provide an RBCommons health status page to show you the current state of our services and a description of what’s going on, so you’ll have something to furiously reload to keep you informed.

Cloud infrastructure is going to go down from time to time. There’s no avoiding that. We’re hopeful that when that happens next time, none of you will notice.

Bitten by downtime – Here’s what’s next

October 22, 2012December 28, 2017 Christian Hammond

Updated 4:18PM: We came back up just over an hour ago, but were not totally reliable, as Amazon’s services were still under maintenance. Things seem to have now stabilized, and you should be able to use RBCommons reliably again.

A few minutes ago, we were alerted to the fact that RBCommons is down. We host on Amazon, and it seems the file storage backend we use is temporarily down, which they’re looking into. On the one-hand, it means they’re on it, and we should be back up soon. On the other, it means everyone is down right now.

This, of course, sucks.

I expect we’ll be back up pretty soon. In the meantime, here’s what we’re going to do to make this less of a problem in the future.

We’ve been preparing this past week on enhancing Review Board and RBCommons so that we can more easily scale out across more of what Amazon calls “availability zones.” We’re close, and when we’re done, we’ll be spreading out our services and adding some more redundancy so this won’t bite us again.

We’re also going to work to become less reliant on EBS, so that if it goes down again, it won’t impact us, or you.

If you were impacted by this today, e-mail us and let us know. We’ll work to make you happy.