Budgeting for Error

What's your uptime SLA over the last month, six months, year? Do you know off the top of your head? Is your kneejerk response, "as close to 100% as possible"? Consider this: by not knowing your true current SLA, you not only turn a blind eye to a critical success metrics for your systems, you also remove the ability to budget within the margins of that metric.

There are about 8765 hours in a year. How many of those hours do you believe your code absolutely needs to be up to keep your business successful? 99% of the time buys you almost 88 hours of downtime over the course of a year. 99.9% of the time still buys you almost 9 hours. Even 99.99% gives you about 52 minutes a year that your systems can be down. Think of what you can do with these minutes. Note that the much-vaunted "5 9s" reliability (99.999%) breaks down to 8 minutes over the course of a year, which is great if you're Google or the phone company, but probably not a smart goal for your average startup.

Let's say you know that your deployment process is rock-solid without outage and you will never need planned hardware downtime due to the way you've architected your systems. But you also know that you have some risky features that you want to push now, before you announce a critical partnership that should result in a big membership bump. If you're sure that the bump won't cause downtime, you might choose to push the features and risk some downtime in smoothing out rough edges on the code so that you have a really compelling site for those new members.

On the other hand, if you know that you're pretty solid under your current load but a 50% increase in usage has the potential for some degree of system failure, your error budget might not accommodate both the risky new features and the membership growth. And if your business pushes you to do both the risky new features and the growth risk? Make sure they know that your SLA may suffer as a consequence. When you know your goal SLA, and you know something is likely to reduce or violate it, that's a strong signal that you should think carefully about the risks of the project. This can also be a useful negotiation tool when being pushed to implement a feature you don't think is ready for prime time. When they say we need to release this new feature today, which means at least two hours of downtime that pushes you out of SLA, it becomes their job to get authorization from the CTO instead of your job to convince them why it is a bad idea.

I will admit that I do not currently have an uptime SLA for my services. Up until recently, it never occurred to me that there would be any value to trying to pin down a number and measure to it. As a result, while liveness and stability is always a consideration, I haven't taken the time to think through the rest of the year when it comes to hardware upgrades, new features, or deployment risk as measured by likely downtime impact. I'm missing out on a key success metric for my infrastructure.

Once I've managed to nail down a course-grained uptime SLA for my systems, the next phase of this work is to nail down a more fine-grained response time SLA. Of 100 requests, what is the 95th percentile response time from my infrastructure services? This is much trickier than a simple uptime SLA due to the interaction of multiple systems each with their own SLAs. For now though, I need to focus on the big picture.

Quick Wins: Monitoring Request Times in Play with Coda Metrics

My twitter feed has been abuzz about coda metrics for a while now. I decided to finally bite the bullet and try it out, and the result was a very nice quick win for our code base.

We're still using Play at work, and we have a service about to go into production that we've been monitoring through the oh-so-elegant method of "writing log messages". This is fine, but it doesn't tell you how long various request types are taking on average without doing a bit of log parsing, and I'm not much of a scripter.

Today, I promised that I would provide something slightly better to measure how our various endpoints are doing. Cue coda. I've been looking at it on and off for a couple of days, but kept getting hung up on wanting to do things like use the EhCache metrics gathering (not trivial in Play at first glance). Going back to basics, I decided after some thinking that the histograms would be the best thing to use. We were already grabbing method execution time for logging purposes, so all I had to do was insert that into a histogram and it would track the running times. Simple enough. But I want to create these histograms for each method type, and ideally, I just want to put it into our superclass controller that is already set up to capture the method timings and log.

Fortunately, Play has lots of nice information floating around in the "request" object of its controllers. Using that object, I can see what Controller subclass this request is destined for, as well as the method that will be called on that class. So I have enough information to create the histogram for each method, like so:

Histogram histo = Metrics.newHistogram(request.controllerClass, request.actionMethod, "requests");
Great. But I was a little tired, and thought that I needed to keep these around, so I stuck them in a ConcurrentHashMap associated with a unique key based on the controller class and the action method. Turns out though, if you look in the MetricsRegistry source code, you'll see that in fact you don't really need to do this at all. As long as the "MetricName" that would be generated for your metric is the same, the same metric will be used for the monitoring. Now THAT is the kind of clever code I like to see.

I decided to keep my ConcurrentHashMap around anyway, to save myself the (utterly trivial) overhead of creating the various objects passed in to the registry by newHistogram. The resulting code is embarrassingly simple. So simple, in fact, I wanted to make it more complicated and it took me 3 revisions to realize how little code I actually needed.

Here is the resulting BaseController, on GitHub, in a skeleton Play application.

I'm a bit sleep-deprived, so if I missed something, be sure to leave a comment or hit me up on twitter!