Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Framework Developers, Application Developers

I was chatting over drinks with a buddy of mine (All Things All Things, aka Joe Stein) the other day, and we both agreed that we were annoyed with open source frameworks that seemed like they were built by people that never had written applications using said frameworks, and sometimes by people that seemed to have never developed applications at all. I've been both an application developer and a framework developer, and I can say without question the worst job I've ever done with a codebase was the case of working on a framework that I never used and didn't originate myself. Why does this happen? I'm a good developer, but I'm not immune to the common pitfalls of framework/library development.

Pitfall 1: Never running a feature in a real application
I think this is a very common problem of frameworks developed by people that aren't actively using them. You think of a cool feature, or maybe some user asks you for one, and you spec it out and implement it. You hopefully write some good unit and integration tests, and everything seems to work. But of course, you neglected to test things like what happens when the whole system is rebooted and the state of this feature changes. Especially with certain kinds of features you can build it half right and have it silently fail for a long time before anyone notices. Quotas in ZooKeeper are an excellent example of this: a monitoring feature that worked until the quota was written to snapshot, and didn't seem to be used by any of the maintainers of the project. (cf this not very descriptive jira)

Pitfall 2: Never having to test application code that uses this framework
I'm hitting this a bit in my usage of the Play framework. It's a framework that did have a lot of testing features built into it but... they neglected to implement Filterable in their Junit runner, so you can't run a single test out of a class in your IDE. I submitted a fix for this feature a few weeks ago that has been withering on the vine, despite the fact that this is an incredibly annoying thing to overlook and a trivial thing to fix. The framework also doesn't support changing the http port on the command line when running tests automatically. Why would you need to, unless you happen to have a code base with several active branches in development that are also being automatically tested as I do right now. The framework developers may never get bitten by this, but it's definitely an annoyance as an application developer using the framework.

Pitfall 3: Throwing in everything and the kitchen sink
I recently saw a retweet asking why the hell Guava would add an Event Bus feature. Does that really belong in a collections framework? When your whole life is the framework you're developing, sometimes no feature seems too small or too unrelated. Unfortunately, putting in too much for the sake of completeness can make your code harder for application developers to fully grasp. If I have several different subtle variations of a method, with slightly different argument lists, I have to constantly check the javadoc and stop to think every time I try to use your library I'm likely to use it less, or just find one way to do it and always do it that way. I will, and have, rejected libraries in the past on the basis of being overly feature-laden. I don't always want or need complexity, and I'd frequently rather work around a small missing element than spend my life searching for exactly the method I want to call.

Pitfall 4: Making your library difficult to read and debug through
When you coat everything in layers upon layers of indirection, reflection, deeply nested interface hierarchies, and painful call graphs, it's hard for your users to figure out what the hell is actually going to happen, and painful to debug through the code when something goes wrong. I can't possibly be the only developer that learns libraries half by reading the documentation, and half by just calling the method that seems right and reading through the code when it doesn't work. This is largely why I absolutely despise Fluent-style development. When it is done perfectly and just works (as in perhaps the case of something like Mockito), it's verbose but acceptable. When it's in a place where there are lots of links in the chain where something could go wrong, it is an absolute nightmare to read and debug. I'm keeping the call stack of my own application in my head, please make your library as easy as possible for me to add to that mental complexity.

The best way to get over most of these pitfalls is to have at least one person on your framework team that actually uses the framework you're developing for something else. Barring that, listen to your users carefully. When they are confused by how to figure out what to call, frustrated by the difficult of debugging, or complaining about the difficulty of testing your framework, these aren't problems to treat lightly. Remember, your framework succeeds or fails based not on it's own internal merits, but on how many people actually use it to develop other code. Application developers are a framework developer's best friend.