Site update: Faster pages, new features

October 28, 2012December 28, 2017 Christian Hammond

Tonight. we deployed a major update to the underlying Review Board code that powers RBCommons. We’ve been holding off on updating in order to get some widespread testing for some pretty significant speed improvements we made, and I’m pleased to say they’re ready.

Speed!

This is the first thing you can expect when you next use the site. It should be a lot faster. We’ve sped up every page, and you’ll see the biggest differences when loading review requests (long ones in particular) and the diff viewer.

The smarts of the diff viewer have been improved, so we better handle very large diffs and diffs with very large lines. This used to take forever, but not anymore!

We did some performance testing on these new changes and saw some crazy drops in page render times. Average sized review request pages went from ~500ms to ~180ms. Large ones with dozens of reviews dropped from ~3 seconds to ~300ms. Diffs saw savings of a few seconds on average, or in pathological cases, as much as a minute.

Smart timestamps

Every time we show a date or time on a page now, you can always be sure that it’s relative to now. Sit and stare at a page for a few minutes and you’ll see time tick away.

Incremental diff expansion

This is my personal favorite new feature. Our diff headers (the brown parts that show the function or class name) have just gotten much more useful.

We now have little expand buttons on these headers. You can expand the diff by 20 lines at a time, or all the way up to the displayed function or class. Or you can still expand the whole thing if you need to.

This is one of those small things that should save you a lot of time.

Bug fixes galore

This update also comes with dozens of bug fixes. Many of these fix compatibility issues we had with certain types of diffs. A few highlights:

Git diffs with binary files, Subversion diffs with property changes, Mercurial diffs with spaces in the filename are all now working.
Subversion diffs that had broken $Keyword$ fields don’t break in the diff viewer anymore.
Git diffs created with format-patch no longer have their extra information (such as a commit description) stripped after uploading.
Better support for Mercurial, particularly on Google Code.
When using Parent diffs, new files are no longer styled wrong in the diff viewer.

If you run into any new problems, contact us and we’ll work to address them quickly.

Downtime Post-Mortem

October 22, 2012December 28, 2017 Christian Hammond

Now that we’re back up and running, I thought it’d be interesting to go over what happened, how and why we were affected, and what we’ve been working on to prevent such a thing from happening again.

Like many companies, we are hosted on Amazon Web Services, and like many companies, we were hit hard when some of their services went down.

We’re in good company though! Reddit, Foursquare, Heroku, Imgur, Flipboard, and many other companies were hit pretty hard as well. Oh, and part of the actual Amazon Web Services administration console. I’m sure we’re all doing some strategizing tonight.

What Happened?

Today, around 10:30AM, Amazon’s EBS (Elastic Block Storage, their filesystem implementation for servers) and RDS (Relational Database Service) services experienced “degraded performance” and “connectivity issues.” Really, they just flat out became inaccessible. This took our RBCommons, the Review Board website, and, for a period of time, the RBCommons Blog.

Pingdom notified us right away that we were down, and we immediately investigated. It took a while to send out an update as to what happened directly to our users, because that list of users was in the RDS instance we couldn’t reach.

Now the goal of any service in the cloud is to be resilient against such breakages. The whole point is to be able to spin up new instances and just go. So why didn’t we just do that?

Well, before you can do that, your architecture must support it. And for that, your dependencies must support it. You can’t rely upon local filesystem state, for instance. If we were some traditional database-backed app, we’d be fine, but we work very closely with various libraries for accessing repositories (libsvn, for example) and with SSH keys (which OpenSSH looks for on a local filesystem). In fact, RBCommons and Review Board are more than happy to scale out as far as needed, but some of our dependencies are not.

New Architecture

We’ve been preparing for this type of outage, but we just didn’t get there in time. The two large problems, distributed SSH key management and Subversion configuration/cert storage, have been mostly fixed.

We wrote a new SSH interface that can deal with distributed storage for SSH keys. It’s compatible with Subversion, Git, Mercurial, and anything else we need it for. We’re no longer limited to the local filesystem storage for SSH. In fact, we don’t even need it. Any instance we spin up will have full access to the keys, and the keys can be made redundant, all while still being securely stored.

We’ve also worked around Subversion’s requirement for local filesystem access for cert storage and configuration. We now carefully store the configuration and certificates we expect for Subversion within the database and then ensure the local instance has that state before performing a request. We can bring up a new instance and not require any reconfiguration.

Both of these options together will allow us to move to a model where something like an EBS failure won’t hurt us very much. Unfortunately, both are still being tested and we weren’t ready to deploy just yet.

Oh the timing…

Deployment Strategy

Now that we’re back up and running, our focus for the week will be to get the new architecture in and to get everyone’s accounts migrated.

Once that is done, we can begin transitioning off of EBS and using what is called a Local Instance Store. This is a filesystem that exists only for that instance, while it is up and running. They’re populated when we deploy the instance and can safely disappear when we tear it down. This creates a nice little bundle that we can start up as many times as we like, in as many availability zones as we need.

The next step from there is to switch to a multi-availability zone deployment of RDS. This will require a bit more testing, but should increase our resilience against RDS failures.

We’re investigating deploying a service in front of our blog and RBCommons front page that will keep an off-site cache of the main pages so that people can still see the site content while the sites are down. This won’t allow you to review code, but at least you’ll know there’s a problem and see some indication of what’s going on.

We’re also looking into the best way to provide an RBCommons health status page to show you the current state of our services and a description of what’s going on, so you’ll have something to furiously reload to keep you informed.

Cloud infrastructure is going to go down from time to time. There’s no avoiding that. We’re hopeful that when that happens next time, none of you will notice.

Bitten by downtime – Here’s what’s next

October 22, 2012December 28, 2017 Christian Hammond

Updated 4:18PM: We came back up just over an hour ago, but were not totally reliable, as Amazon’s services were still under maintenance. Things seem to have now stabilized, and you should be able to use RBCommons reliably again.

A few minutes ago, we were alerted to the fact that RBCommons is down. We host on Amazon, and it seems the file storage backend we use is temporarily down, which they’re looking into. On the one-hand, it means they’re on it, and we should be back up soon. On the other, it means everyone is down right now.

This, of course, sucks.

I expect we’ll be back up pretty soon. In the meantime, here’s what we’re going to do to make this less of a problem in the future.

We’ve been preparing this past week on enhancing Review Board and RBCommons so that we can more easily scale out across more of what Amazon calls “availability zones.” We’re close, and when we’re done, we’ll be spreading out our services and adding some more redundancy so this won’t bite us again.

We’re also going to work to become less reliant on EBS, so that if it goes down again, it won’t impact us, or you.

If you were impacted by this today, e-mail us and let us know. We’ll work to make you happy.

New sleek redesign

October 11, 2012December 28, 2017 Christian Hammond

Last night we deployed some new changes to our landing page, pricing page, and overall RBCommons site styles. The new landing page gives new users a small tour of the RBCommons features. We’ll update this as new features are introduced.

We decided it was time to give RBCommons a bit of a visual refresh, so this was the start of that. You’ll begin to see this on many of the pages. The code review sections are still as they were, but in time we’ll refresh these as well to better match.

Now unfortunately, this came at a cost. We had a temporary issue with registration that lasted a few hours. Our credit card integration broke and new teams couldn’t sign up. I’m very sorry if this affected you. If you were affected, please let us know, and try again.

Let’s Keep In Touch :)

October 9, 2012 Christian Hammond

RBCommons has been growing up. Once upon a time, we had only a few early teams signed up, and back then e-mail worked just fine for support and announcements. These days, it’s another story, and we really needed to give you all a better insight into what’s going on with RBCommons.

We’re now giving you two easy ways to do that.

The first is the new RBCommons blog (what you’re reading here!). You can easily subscribe to our feed, or just, you know, set this as your home page. Why not.

The second is the new @RBCommons Twitter account. Any new posts here will go there, along with any emergency announcements, or whatever we feel like posting.

And as always, if you ever have any questions or comments, you can easily contact us.

Say hi sometime!