How Twitter's back-end stays up

Raffi Krikorian is a smart engineer and technical manager; I've know him for decades, but didn't realize that he was currently Vice President for Platform Engineering at Twitter. This interview with him at InfoQ gives a really fascinating flavor for how Twitter's reliability engineering is baked in at the systems level:

So Decider is one of our runtime configuration mechanisms at Twitter. What I mean by that is that we can turn off features and software in Twitter without doing a deploy. So every single service at Twitter is constantly looking to the Decider system as to what are the current runtime values of Twitter right now. How that practically maps is I could say the discover homepage, for example, has a Decider value that wraps it, and that Decider value tells discover whether it's on or off right now.

So I can deploy discover into Twitter and have it deployed in the state that Decider says it should be off. So this point we don't get an inconsistent state. The discover, for example, or any feature at Twitter runs across many machines. It doesn't run on one machine, so you don't want to get in the inconsistent state where some of the machines have the feature and some of them don't. So we can deploy it off using Decider and then when is on all the machines that we want it to be on, we can turn it on atomically across the data center by flipping a Decider switch.

This also gives us the ability to do a percentage-based control. So I can say actually now that it's on all of the machines, I only want 50% of users to get it. I can actually make that decision as opposed to it being a side effect of the way that things are being deployed in Twitter. So this allows us to really have a runtime control over Twitter without having to push code. Pushing code is actually a dangerous thing, like the highest correlation to failure in a system like ours, not just Twitter but any big system, is software development error. So this way we can actually deploy software in a relatively safe way because it's off. Turn it on really slowly, purposely, make sure it's good and then ramp it up as fast as I want.

Interview with Raffi Krikorian on Twitter's Infrastructure

(via O'Reilly Radar)