release management

War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

ZooKeeper 3.4: Lessons Learned

After several months on the planning block, it looks like ZooKeeper 3.4 is finally almost ready to be released. (Edit: Hooray! As of 11/22, release 3.4 is available!) I can say with confidence that all of the committers for the project have learned a lot from the course of this release. And most of it is in the form of "ouch, lessons learned".

First lesson: Solidify your new feature set early.
Going through the Jira, the earliest new feature for the 3.4 release is the uplift of the ZAB protocol to ZAB1.0. No small feature, to be sure, we were still debugging minor issues with it through the very end stages of our 3.4 work. We also added multi transactions, kerberos support, a read-only zookeeper, netty, windows support for C, and certainly others I'm forgetting. Some of these features were pretty simple uplifts, but some of them caused us build instability for months and a great deal of distraction. Many of these were added as "just one more feature". But many other features were neglected because "we're almost ready for 3.4" (as it turned out, often not actually the case). If we had decided early what new major features we were pushing for with 3.4, we could have concentrated our efforts more effectively and delivered much sooner.

Second lesson: When it's time to push, push.
Giving birth requires a period of concentrated pushing. If you think you can push a little now, then put it off for a few days, then a bit now, then a few weeks off... the baby will never come, and neither will the release. It took several attempts before the community finally rallied behind the efforts to get a release out, and we ended up losing a lot of momentum in the process. We didn't have a solid and pre-agreed-upon features to know when we were done, so things just kept getting in the way. When the attention on the release was off, a minor bug or feature request would come in and it just seemed so small, what was the harm?

Third lesson: Prioritize as a community, and stick to those priorities
This falls in with setting up a feature list early, but it goes beyond that. Our community was split between those who were very interested in seeing 3.4 released, and those who were working on major new changes or refactorings against trunk. As a result we all ended up feeling shortchanged. Contributors with new features did not get the attention their features needed, and many still sit in unreviewed patch form. Users that were hungry for the 3.4 release were frustrated with our lack of attention to getting it out. We had some massive new refactoring efforts that continued to happen on trunk during the course of the release process, which resulted in a frustrated committer base stuck backporting or forwardporting patches between increasingly divergent branches. These efforts found bugs, but not without some cost. Having unclear priorities divided the community, caused some tension, and ultimately slowed the whole release process down.

Fourth lesson: You can always do more releases, it doesn't all have to happen now
This is perhaps my own biggest takeaway from this process. I wish we had done much less, done it much faster, and been willing to release a 3.4 that was quickly followed by 3.4.1, 3.5, etc, as needed. Proponents of agile development and release practices have a good point; the more often you release, the less there is to go wrong and the easier it will be to fix if and when it does. It becomes a self-fulfilling prophecy. We don't release frequently so people want to cram as many new features in as possible, which slows down the releases, which results in pushes for more new features, which results in more bugs and slowed down releases, and on and on.

These lessons may seem obvious in retrospect, but they came at the price of many people's time and effort. I'm proud of our community for pulling together in the end, but I also hope that 3.5 will be a different and less arduous journey.