Code Reviews, Code Stories

June 20, 2012

Code reviews are a tool that incites a lot of strong emotions from developers. Those who love them write tweets like

Code review is hard - both reviewing and being reviewed make people cranky. But I know of no better way to make great software. -- @yminsky

This tweet, even in its adoration, still captures some of what people hate about code reviews. They make people cranky, to put it mildly. I personally don't like code reviews very much. In my experience, they are just as likely used to bully, score points, or waste time on pedantic style notes as they are to produce great software. In fact, I think code reviews are absolutely necessary only in three kinds of situations:

1) You don't have good automated testing practices in place to catch most errors at or before checkin.
2) You work on a big project with a very geographically distant team of developers (most open source).
3) You work in a company of many developers and huge code bases where strict style conformity is very important.

In my company, we have one team that falls into slot number 1, and they do code reviews for all checkins. The rest of the team does ad-hoc code reviews and occasional code walkthroughs, and we rely on the heavy use of testing to ensure quality. In general I feel that this enables a good mix of code correctness and checkin velocity.

The one thing this misses, though, is the required opportunity for developers to learn from each other. I like pair programming, and encourage my developers to do it when they are first starting projects or figuring out tricky bugs. But a lot of the time my developers are working in parallel on different code bases, and I felt that we were missing that learning moment that code reviews and pairing can provide. So I started asking people in interviews how they encouraged their team to work together and learn from each other, and I got a great suggestion: encourage the developers to show off work that they're proud of to other team members. Thus the Power Hour was born.

"Power Hour" is an hour that we do every Friday morning with my team and our sister team of warehouse developers. In that hour every developer pairs up with someone else and they take turns sharing something they did that week. Show off something you're proud of, ask for help with a problem you're stuck on, or get feedback on the way something was or is being designed. The power and choice of what to show lies in the hands of the person showing it, which takes away the punitive nature of code reviews; all they are required to do is to share some code. Show off, get help/feedback, and then switch. It's not about code review, that line-by-line analysis and search for errors big and small. It's about code stories, whether they are war stories, success stories, or stories still being written.

There are a lot of features of code review that this Power Hour doesn't satisfy. It's not a quiet opportunity to evaluate someone else's code without them looking over your shoulder. It's driven by the person that wrote the code and they are encouraged to highlight the good parts and maybe even completely bypass the ugly parts. It's not possible for the other person to fully judge the quality of what they're seeing, because judging isn't the point at all. Power Hour is a learning and sharing opportunity, and it is explicitly not an occasion for judging.

This has turned out to be a smashing success. It's amazing how much you can teach and learn in a concentrated hour of working with another person. Last week I showed off a new Redis-based caching implementation I wrote, and got to see all the code and a demo of a new Shiro authentication and entitlement system. I caught a few things I still needed to clean up in the process of explaining my work, and so did my partner. If you, like me, shun formal code reviews, I encourage you to try out a weekly turn at code stories. It's a great way to build up teamwork, encourage excellence, and produce great software, without the crankiness.

War Stories: Guava, Ehcache, Garbage Collection

May 31, 2012

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out.

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class.

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:

But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Hammers and Nails: Managing Complexity

May 27, 2012

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Framework Developers, Application Developers

January 12, 2012

I was chatting over drinks with a buddy of mine (All Things All Things, aka Joe Stein) the other day, and we both agreed that we were annoyed with open source frameworks that seemed like they were built by people that never had written applications using said frameworks, and sometimes by people that seemed to have never developed applications at all. I've been both an application developer and a framework developer, and I can say without question the worst job I've ever done with a codebase was the case of working on a framework that I never used and didn't originate myself. Why does this happen? I'm a good developer, but I'm not immune to the common pitfalls of framework/library development.

Pitfall 1: Never running a feature in a real application
I think this is a very common problem of frameworks developed by people that aren't actively using them. You think of a cool feature, or maybe some user asks you for one, and you spec it out and implement it. You hopefully write some good unit and integration tests, and everything seems to work. But of course, you neglected to test things like what happens when the whole system is rebooted and the state of this feature changes. Especially with certain kinds of features you can build it half right and have it silently fail for a long time before anyone notices. Quotas in ZooKeeper are an excellent example of this: a monitoring feature that worked until the quota was written to snapshot, and didn't seem to be used by any of the maintainers of the project. (cf this not very descriptive jira)

Pitfall 2: Never having to test application code that uses this framework
I'm hitting this a bit in my usage of the Play framework. It's a framework that did have a lot of testing features built into it but... they neglected to implement Filterable in their Junit runner, so you can't run a single test out of a class in your IDE. I submitted a fix for this feature a few weeks ago that has been withering on the vine, despite the fact that this is an incredibly annoying thing to overlook and a trivial thing to fix. The framework also doesn't support changing the http port on the command line when running tests automatically. Why would you need to, unless you happen to have a code base with several active branches in development that are also being automatically tested as I do right now. The framework developers may never get bitten by this, but it's definitely an annoyance as an application developer using the framework.

Pitfall 3: Throwing in everything and the kitchen sink
I recently saw a retweet asking why the hell Guava would add an Event Bus feature. Does that really belong in a collections framework? When your whole life is the framework you're developing, sometimes no feature seems too small or too unrelated. Unfortunately, putting in too much for the sake of completeness can make your code harder for application developers to fully grasp. If I have several different subtle variations of a method, with slightly different argument lists, I have to constantly check the javadoc and stop to think every time I try to use your library I'm likely to use it less, or just find one way to do it and always do it that way. I will, and have, rejected libraries in the past on the basis of being overly feature-laden. I don't always want or need complexity, and I'd frequently rather work around a small missing element than spend my life searching for exactly the method I want to call.

Pitfall 4: Making your library difficult to read and debug through
When you coat everything in layers upon layers of indirection, reflection, deeply nested interface hierarchies, and painful call graphs, it's hard for your users to figure out what the hell is actually going to happen, and painful to debug through the code when something goes wrong. I can't possibly be the only developer that learns libraries half by reading the documentation, and half by just calling the method that seems right and reading through the code when it doesn't work. This is largely why I absolutely despise Fluent-style development. When it is done perfectly and just works (as in perhaps the case of something like Mockito), it's verbose but acceptable. When it's in a place where there are lots of links in the chain where something could go wrong, it is an absolute nightmare to read and debug. I'm keeping the call stack of my own application in my head, please make your library as easy as possible for me to add to that mental complexity.

The best way to get over most of these pitfalls is to have at least one person on your framework team that actually uses the framework you're developing for something else. Barring that, listen to your users carefully. When they are confused by how to figure out what to call, frustrated by the difficult of debugging, or complaining about the difficulty of testing your framework, these aren't problems to treat lightly. Remember, your framework succeeds or fails based not on it's own internal merits, but on how many people actually use it to develop other code. Application developers are a framework developer's best friend.