development

Code Reviews, Code Stories

Code reviews are a tool that incites a lot of strong emotions from developers. Those who love them write tweets like 
Code review is hard - both reviewing and being reviewed make people cranky. But I know of no better way to make great software. -- @yminsky
This tweet, even in its adoration, still captures some of what people hate about code reviews. They make people cranky, to put it mildly. I personally don't like code reviews very much. In my experience, they are just as likely used to bully, score points, or waste time on pedantic style notes as they are to produce great software. In fact, I think code reviews are absolutely necessary only in three kinds of situations:

1) You don't have good automated testing practices in place to catch most errors at or before checkin.
2) You work on a big project with a very geographically distant team of developers (most open source).
3) You work in a company of many developers and huge code bases where strict style conformity is very important.

In my company, we have one team that falls into slot number 1, and they do code reviews for all checkins. The rest of the team does ad-hoc code reviews and occasional code walkthroughs, and we rely on the heavy use of testing to ensure quality. In general I feel that this enables a good mix of code correctness and checkin velocity.

The one thing this misses, though, is the required opportunity for developers to learn from each other. I like pair programming, and encourage my developers to do it when they are first starting projects or figuring out tricky bugs. But a lot of the time my developers are working in parallel on different code bases, and I felt that we were missing that learning moment that code reviews and pairing can provide. So I started asking people in interviews how they encouraged their team to work together and learn from each other, and I got a great suggestion: encourage the developers to show off work that they're proud of to other team members. Thus the Power Hour was born.

"Power Hour" is an hour that we do every Friday morning with my team and our sister team of warehouse developers. In that hour every developer pairs up with someone else and they take turns sharing something they did that week. Show off something you're proud of, ask for help with a problem you're stuck on, or get feedback on the way something was or is being designed. The power and choice of what to show lies in the hands of the person showing it, which takes away the punitive nature of code reviews; all they are required to do is to share some code. Show off, get help/feedback, and then switch. It's not about code review, that line-by-line analysis and search for errors big and small. It's about code stories, whether they are war stories, success stories, or stories still being written.

There are a lot of features of code review that this Power Hour doesn't satisfy. It's not a quiet opportunity to evaluate someone else's code without them looking over your shoulder. It's driven by the person that wrote the code and they are encouraged to highlight the good parts and maybe even completely bypass the ugly parts. It's not possible for the other person to fully judge the quality of what they're seeing, because judging isn't the point at all. Power Hour is a learning and sharing opportunity, and it is explicitly not an occasion for judging.

This has turned out to be a smashing success. It's amazing how much you can teach and learn in a concentrated hour of working with another person. Last week I showed off a new Redis-based caching implementation I wrote, and got to see all the code and a demo of a new Shiro authentication and entitlement system. I caught a few things I still needed to clean up in the process of explaining my work, and so did my partner. If you, like me, shun formal code reviews, I encourage you to try out a weekly turn at code stories. It's a great way to build up teamwork, encourage excellence, and produce great software, without the crankiness.

Process Debt and Team Scalability

Most of us in tech spend time thinking about technical debt. Whether it's a short-term loan or a massive mortgage, we've all seen the benefits and costs of managing and planning a code base with technical debt, and we are generally familiar with ways to identify, measure and eliminate this debt.

What about process debt? Process debt is a lot like technical debt, in that it can be a hack or a lack. The lack of process as debt is pretty easy to see. For example, never bothering to create an automated continuous build for your project. That's a fine piece of process to avoid when you have only one or two developers working on a code base, but as your team grows, your lack of process will start to take its toll in broken code and wasted developer time.

There are also plenty of ugly process hacks. Take for example the problem of a team of varying skill levels, and a code base that is increasingly polluted by poorly formatted, totally unreadable code. When faced with this problem you may be tempted to institute a rule where every line of code must be reviewed by a teammate. Later you discover that the two people with the worst style are reviewing each other, and things still aren't improving enough. So you declare that everyone must have their code reviewed by one of the senior developers that you have appointed. Now you have better style, but at the cost of everyone's productivity, especially your senior developers.

Adding code reviews was the easiest path to take. You just hacked in some process instead of paying up front to think about what the best process would be. But process, once added, is hard to remove. This is especially true for process intended to alleviate risk. Maybe you realize after a few months that the folks with questionable style have learned the ropes, and you remove the rule. But what if you added code reviews not to fix a style problem, but to act as a risk mitigant? I've done this myself in the case of a code base with very little automated testing and a string of bad releases. I'm now paying for my technical debt (a lack of automated testing) by taking out a loan on process.

An easy way to identify process debt is to ask yourself the following question: Will this solution still work when I have twice as many developers? What about ten times as many? A hundred? Process debt, like technical debt, makes scaling very difficult. If every decision requires several meetings and a sign-off, if every git pull is a roll of the dice for your continued productivity, or if you have to hire three project managers for every five developers just to keep track of the task list, you're drowning yourself in process debt.

Developers are wary of process because we too often experience process debt of the hack sort. I believe that both a lack of process and an excess of process can cause problems, but if we begin to look at process debt with the same awareness and calculations that we apply to technical debt, we can refactor our playbook the way we refactor our code and end up with something functional, simple, and maybe even elegant.

Budgeting for Error

What's your uptime SLA over the last month, six months, year? Do you know off the top of your head? Is your kneejerk response, "as close to 100% as possible"? Consider this: by not knowing your true current SLA, you not only turn a blind eye to a critical success metrics for your systems, you also remove the ability to budget within the margins of that metric.

There are about 8765 hours in a year. How many of those hours do you believe your code absolutely needs to be up to keep your business successful? 99% of the time buys you almost 88 hours of downtime over the course of a year. 99.9% of the time still buys you almost 9 hours. Even 99.99% gives you about 52 minutes a year that your systems can be down. Think of what you can do with these minutes. Note that the much-vaunted "5 9s" reliability (99.999%) breaks down to 8 minutes over the course of a year, which is great if you're Google or the phone company, but probably not a smart goal for your average startup.

Let's say you know that your deployment process is rock-solid without outage and you will never need planned hardware downtime due to the way you've architected your systems. But you also know that you have some risky features that you want to push now, before you announce a critical partnership that should result in a big membership bump. If you're sure that the bump won't cause downtime, you might choose to push the features and risk some downtime in smoothing out rough edges on the code so that you have a really compelling site for those new members.

On the other hand, if you know that you're pretty solid under your current load but a 50% increase in usage has the potential for some degree of system failure, your error budget might not accommodate both the risky new features and the membership growth. And if your business pushes you to do both the risky new features and the growth risk? Make sure they know that your SLA may suffer as a consequence. When you know your goal SLA, and you know something is likely to reduce or violate it, that's a strong signal that you should think carefully about the risks of the project. This can also be a useful negotiation tool when being pushed to implement a feature you don't think is ready for prime time. When they say we need to release this new feature today, which means at least two hours of downtime that pushes you out of SLA, it becomes their job to get authorization from the CTO instead of your job to convince them why it is a bad idea.

I will admit that I do not currently have an uptime SLA for my services. Up until recently, it never occurred to me that there would be any value to trying to pin down a number and measure to it. As a result, while liveness and stability is always a consideration, I haven't taken the time to think through the rest of the year when it comes to hardware upgrades, new features, or deployment risk as measured by likely downtime impact. I'm missing out on a key success metric for my infrastructure.

Once I've managed to nail down a course-grained uptime SLA for my systems, the next phase of this work is to nail down a more fine-grained response time SLA. Of 100 requests, what is the 95th percentile response time from my infrastructure services? This is much trickier than a simple uptime SLA due to the interaction of multiple systems each with their own SLAs. For now though, I need to focus on the big picture.

Parallelism and the Limits of Languages

While we make heavy use of the core of Clojure, we don't use its concurrency primitives (atoms, refs, STM, etc.) because a function like pmap doesn't have enough fine grained control for our needs. We opt instead to build our own concurrency abstractions in Clojure on top of the outstanding java.util.concurrent package.

 Once upon a time, in the days when RAM was not quite as cheap as it is now, I worked on a team that had a giant in-memory cache of data that it used to do various financial calculations. This data had, with the growth of the financial markets it represented, grown to be too big to fit into the memory of a standard server without having GC absolutely wreck the process. Having exhausted all minor efforts to tune the data, and not wanting to buy an actual supercomputer, the team turned to Azul, who at that time sold boxes with enough memory to fit our process and a fairly nifty no-pause GC. Great! Except there was one catch: instead of having the super fast CPUs we were used to, it had hundreds of tiny little slow cores (with rather crappy fpus). And on these hundreds of tiny little cpus our code was very, very slow. So we set out to parallelize.

When you're forced into parallelism at that level of detail, you quickly realize how difficult it is to have a generic way of doing it. The fact that languages like scala and clojure will allow you to just say "apply this logic over the collection in parallel" is very cool, and a great way for developers to dip their toes in parallel processing of their collection data. But it breaks down quickly when you are talking about vastly different types of data running in vastly different types of computation. If you need to eke out every ounce of performance, you want every computation to be tuned to the right batch size and the right number of threads, and the size of the thread pool for task X may depend on whether it tends to run at the same time as task Y or task Z. Don't forget that task Z, if requested, must be prioritized above all else. The size of the batches themselves may depend on the type of object inside them, and how that affects the cache lines of the processor you're running on. Oh, and of course, you're not just tuning this for prod, it has to at least run to completion on your integration boxes and they are of course slightly different processors. And don't forget, your developers aren't writing code on Azul boxes, they're writing code on desktops with 2 cores each; threading even the test data to a pool of 50 doesn't make any sense, but they still need to run things in parallel or they'll never see the stupid threading bugs they added when they forgot to use concurrent data structures.

I've always thought that the next truly big leap in VMs and languages will be those that can do smart things with threading without making developers sweat too much. If we're really entering a world of more and more data over more but slower cores, we need a language and a VM that revolutionize our management of threads the way Java and the JVM have revolutionized our notion of memory management. The functional languages out there help, but you can't forget the underlying VM magic that will really make this possible. Look to the example of memory management: it's not Java itself that enables you to write code without worrying about deleting references only when they are truly not being used, but rather the VM that watches all objects and knows how to find the live ones and delete those that are no longer referenced, all without the developer needing to indicate a thing. For us to conquer the thread problem, we need a VM to observe the behavior of the system beyond just the number of cores and help us to appropriately occupy CPUs over collections of varying size and composition. Languages can force developers to think about applying logic over their data (and thus help with parallelism as a mindset), but they will never solve the problem of automated parallelism without the VM to tune it in the background.

There will always be people that need to tune their code down to exactly what instructions run on what core, just as there are still people that need to do their own memory management, but it shouldn't be a requirement for getting good mileage out of those cores. I can't wait for the day when calling submit feels as archaic as calling malloc.

Developer Joy, Part 2

I got some good comments on my last post, including a few submissions. One idea that I intentionally left out was that of working on a project you're passionate about. In fact, I think the point of developer joy is that developer joy has to transcend love for a project. It is the stuff that keeps us going despite working on code with a purpose which we find boring. It is the stuff that makes writing dull-but-important software worthwhile. It's the stuff that, in short, every good tech lead should keep an eye on.

Sadly, most developers are not working on problems that they find intrinsically fascinating. I told myself in coming into computing as a career that one reason I was doing it was that it had application in so many areas. And getting to apply computing to the fashion industry is a problem I find rather fascinating (and never in a million years expected to be solving). But I spent many years in industries that bored me to death to get to where I am today, and I still enjoyed myself immensely doing it so long as I had enough elements of developer joy.

As an illustration of this joy, one of the more enjoyable projects I ever worked on was the distribution of the test runner for a code base I worked on. The code base had thousands of tests, most of which were slow integration tests, and we expected this test suite to run and pass both on every checkin, and before developers checked in code. We'd gotten to a point where tests were taking 4 hours and most of a developer's local CPU cycles. We weren't willing to just get rid of tests, we weren't able to refactor the external dependencies completely to make the individual tests themselves faster (ah, sybase stored procedures!), so we set about to figuring out how to run this build process across a farm of machines. I like distributed computing, but this problem was more about dealing with the distribution system we used, debugging strange transient errors when we pulled formerly-sequential tests apart, and setting up a bunch of infrastructure to get things working.

Why did this project end up being so fulfilling? Well, for one, the team working on the project was incredibly smart. The main developer taught me a ton about the underlying distributed system and in turn we had some great team sessions of brainstorming. The whole project was around testing, and of course we wrote tests themselves for the project components so it was a well-developed system. We had plenty of time to actually work on the problem. And the solution we produced was pretty awesome. In short, it was a fun problem to solve not because I love build systems and tools (a common mis-conception about me, actually), but because I love solving hard problems, solving them well, and solving them with smart people.

Some developers are always going to need projects that they are passionate about. Those developers tend to form startups, go into academia, or find a spot in a big company with the resources to splash on their particular niche. But most of us work on problems that we solve to pay the bills. That doesn't mean there isn't joy in a day's work. When that work involves learning new things, perfecting your own expertise, being able to solve problems and see those solutions delivered, it gives us joy. Tech leads, CTOs, and founders sometimes forget that the people working for them may not have the same love for the project that they do. Keeping up the elements of developer joy helps bridge that inspiration gap.