Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Keep it Simple, Dingus

Earlier this week, I took a few devs from my company over to visit dB had offered to talk front-end technology choices and techniques with us, and it ended up being a very useful overview of lessons that they learned in building out their incredibly beautiful UI.

Much of the content went over my head. As you may be able to tell from the ultra-boilerplate layout of this blog, I'm not much of a front-end developer. But one thing that dB said in passing really hit home. When asked why he had chosen to use (or not use) some technology, he made the point that he wanted to keep the set of different systems he was using very limited to begin with because he thought that the architectural complexity overhead would be bad for a small startup.

I've been thinking a lot about that idea this week as I build out a new back end system to host our product data. We've already decided to go with MongoDB for the storage layer. It's easy to set up, supported by Play modules, and our ops people know how to administer it. Most importantly, it works very well for the type of data that we are sticking into it. Everything about the project has gone smoothly, and so over the past couple of days I've been trying to figure out how to do generic text search over the data. I want to use something that has all the goodies of lucene built-in, and I don't think that the MongoDB text search is going to quite do it (although using Mongo to provide fast filtering has been a dream).

I quickly found myself down a rabbit hole. Solr couldn't immediately parse the bson that Mongo spit out for our documents (not surprising), and I didn't find any easy translators online. I started looking at what modules existed in Play to enable integration with a search system and came across elasticsearch, then fell further down the rabbit hole trying to get the module to support not only JPA objects but morphia-generated Mongo objects. In the process I re-read articles and notes on the idea of using solr as the storage layer with no other backing store, was pointed at SenseiDB, and generally began to feel a sense of despair at the complexity of it all.

I like systems. I like to learn systems, to understand their strengths and weaknesses, to read their code. In the 6 or so weeks I've been on this new job, I've had to resist the urge to push immediately moving to a Hadoop-based analytics platform, our own distributed file system for image stores, possibly a different nosql for our user data. I realize now that I need to fight that urge even more. Using the perfect tool for every job incurs an administrative overhead and attention thrashing that an infrastructure team of 3 cannot possibly hope to manage well.

So tomorrow I'm going to take a step back and think about how I can simplify my text search problem. Maybe the answer is to just do it in Mongo for now, and save the feature/complexity tradeoff for another day. One thing is certain: right now, it's more important that I become an expert in the systems I have than pick the perfect technology for each tiny problem.