Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Why I'm Moving Away from the Play Framework

I've been using the Play framework since I started at RtR 3 months ago. Last week, I made the decision that no new services will be written in Play from that point forward. It started out as a great little framework that was pretty quick to learn and easy to use, but it's turned into something that I would not recommend anyone use for serious production applications. What happened?

First, I lost faith in the developers.

One of the first things that annoyed me about Play was the inability to run a single test from within a play test class inside your IDE. I suppose the thinking was that you will always run the play test app, or something, but I prefer to leave my IDE as little as possible when I'm working, and running an entire test class worked fine. So, being the good little open source programmer that I am, instead of bitching I rolled up my sleeves and fixed the bug. It was a pretty trivial fix. I even wrote a test case. Then I put in a pull request and waited.
After submitting the pull request, I commented on the pull request, commented on the ticket, and finally sent an email to the mailing list. And the response I got was basically that the team is too busy working on the next generation of the product to absorb fixes for the older generation. Having worked on open source projects myself, I understand what it's like to have limited bandwidth to look at changes. But if the project team's bandwidth is so limited that they can't even afford to look at small fixes like this, it seems like the project is basically abandoned. At that point I lost faith that I could rely on the community to support the 1.X branch of this product. Not necessarily a dealbreaker, but definitely a bad sign. 

Then, I lost faith in the platform.

We started to hit some serious bugs in the platform during a big push on a complex service. First, our developers that used Mac and Windows hit a bug similar to this, where they just simply couldn't get the app to work no matter what they did. It worked fine in linux, but even a clean checkout would fail to run for them. It was inexplicable, irritating, and we lost a couple of days of development work trying to get around it (rolling back checkins, pulling out modules, poring through stack traces). By this point, I had lost faith in the community, so I didn't see the point in going to them for help. Fortunately, we did finally get around it (it seemed to be a bug in the CRUD module), but we were all really frustrated and annoyed with the framwork after that experience.

Finally, I lost faith in my own ability to debug the framework.

The issues above were enough for me to want to move off of Play for new projects. The thing that caused me to move off of Play for projects that are already in development (but not in production) was this: At some point, we had written a migration job in Play for a major data migration. We discovered the strangest thing would happen. The job would run across several job threads, and at some point, one of the threads would hang. But it would not hang in a way that I have ever seen a JVM thread hang. The thread was in RUNNABLE state, and it was in a HashMap method (either get or put) and it was just sitting there. Not doing anything. No locks, no database or other IO, plenty of memory, plenty of resources, just sitting in that HashMap.get method, hanging out.
Now, maybe you've seen that before (and if you have, please leave a comment!). But I have seen a lot of JVM issues in my day, and this is a new one. There was no reason for this thread to be hung. And yet it was. I can debug just about anything you can throw at me in Java, but I had absolutely nowhere to start looking to debug this issue, except a vague suspicion that it was related to the way the framework was rewriting the classes under the covers. That is a dealbreaker, ladies and gents. I could've probably debugged why the module was causing the app to crap out for my developers, if given enough time. But I cannot say with any certainty that I could debug whatever the hell was causing that thread to hang.

If I felt the developers supporting play were committed to building a real community of support around the 1.X version, I might have stayed with it longer. It's a giant pain to find something else that is easy, lightweight, supports JPA and doesn't force me to write XML. But I can't use a product that I know has issues even I can't debug, and a team I don't trust to maintain the product to the standards my team needs to confidently use it in production.

Networking woes in Java

The only major CS subject I never took a class in was networking. It's kind of ridiculous, looking back, that I took as many systems classes as I did but always eschewed networking. I do own a copy of UNIX Network Programming: Networking APIs: Sockets and XTI; Volume 1, bought at some point in the past when I knew I was going to be doing some distributed systems work and figured it would be a useful reference. But I can't say it's been my constant companion. For I have learned one thing in my years of Java systems coding:

Networking code is HARD.

Here's exhibit A: ZooKeeper monitoring misuses sockets. I spent a good chunk of time desperately trying to figure out why my monitoring commands were crapping out halfway through when run from NY to LN. Turns out, you can't safely expect to just close half a socket, leave the other half open, push some data to it and then close it while seeing all the data through to the other side. Not without a final handshake indicating the client has gotten all the data. Or at least, I think that's the case. The thing is, this will work well enough over a very fast network connection or with very little data. The guarantees around so_linger etc change kernel to kernel and my reading at the time led me to think that in fact the standard linux kernel behavior in this case may well have changed over the years that ZooKeeper has been around. So we need to completely rip out and redo the monitoring code if we want to have any hope of this working right for other big, global deployments in the future.

Exhibit B is my current debugging nightmare. Part of our release last week involved a new backend Play service that itself connects to a different backend Play service to prepare results for our storefront. We noticed, several hours after launch, that the service started to throw exceptions that were ultimately caused by Too many open files. I know enough about Java to know that running out of file descriptors is often a Bad Thing. 

So we're leaking sockets. Why? To date, we don't know. The underlying libraries are async-http-client and netty, but there's very little to indicate what is going on.1 The sockets show up in netstat/lsof as ESTABLISHED TCP connections to the various storefront servers. But the storefront servers do not have most of these sockets open on their end. How are they ESTABLISHED with no partner? It's an ongoing mystery, one that we haven't been able to reproduce on any other machine (the current theory is bad network hardware/software at the lowest levels, but honestly that's just a shot in the dark and one that we can't verify without taking down a production service).

So, while I keep debugging, what are the takeaways here?

1) You shouldn't write your own socket handling code in Java. Really, no. Don't do it. Use Netty. It's very good. Of all the things not to reinvent yourself, I would put networking at the top of the list with a bullet. It's hard, and requires the kind of deep expertise that you can't fake. And, when you fake it, you may end up with something like our ZooKeeper monitoring, that seems to work for years while hiding small but significant bugs.

2) If you're a system architect writing any kind of web services/distributed system architecture, you should know your unix socket monitoring commands. lsof is obtuse but powerful. netstat is simpler and still quite useful. This article has a few others, like ss and iftop. Know how to up the ulimits for your processes in case you find yourself with a slow socket leak that you need time to debug.

Have an idea what my bug is? I'd love to hear it! Leave me a comment or hit me up on twitter!

Edit 2/27: Looks like our bug was indeed on the cloud vendor side; possibly a misconfigured firewall. Moving to a new box and rebuilding the box we were on solved it.

1 Thank God Play is at least using good networking libraries, because the last time I tested ZooKeeper, when it runs out of sockets the service hard fails with almost no indication of what happened. 

Quick Wins: Monitoring Request Times in Play with Coda Metrics

My twitter feed has been abuzz about coda metrics for a while now. I decided to finally bite the bullet and try it out, and the result was a very nice quick win for our code base.

We're still using Play at work, and we have a service about to go into production that we've been monitoring through the oh-so-elegant method of "writing log messages". This is fine, but it doesn't tell you how long various request types are taking on average without doing a bit of log parsing, and I'm not much of a scripter.

Today, I promised that I would provide something slightly better to measure how our various endpoints are doing. Cue coda. I've been looking at it on and off for a couple of days, but kept getting hung up on wanting to do things like use the EhCache metrics gathering (not trivial in Play at first glance). Going back to basics, I decided after some thinking that the histograms would be the best thing to use. We were already grabbing method execution time for logging purposes, so all I had to do was insert that into a histogram and it would track the running times. Simple enough. But I want to create these histograms for each method type, and ideally, I just want to put it into our superclass controller that is already set up to capture the method timings and log.

Fortunately, Play has lots of nice information floating around in the "request" object of its controllers. Using that object, I can see what Controller subclass this request is destined for, as well as the method that will be called on that class. So I have enough information to create the histogram for each method, like so:

Histogram histo = Metrics.newHistogram(request.controllerClass, request.actionMethod, "requests");
Great. But I was a little tired, and thought that I needed to keep these around, so I stuck them in a ConcurrentHashMap associated with a unique key based on the controller class and the action method. Turns out though, if you look in the MetricsRegistry source code, you'll see that in fact you don't really need to do this at all. As long as the "MetricName" that would be generated for your metric is the same, the same metric will be used for the monitoring. Now THAT is the kind of clever code I like to see.

I decided to keep my ConcurrentHashMap around anyway, to save myself the (utterly trivial) overhead of creating the various objects passed in to the registry by newHistogram. The resulting code is embarrassingly simple. So simple, in fact, I wanted to make it more complicated and it took me 3 revisions to realize how little code I actually needed.

Here is the resulting BaseController, on GitHub, in a skeleton Play application.

I'm a bit sleep-deprived, so if I missed something, be sure to leave a comment or hit me up on twitter!