JVM

War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Parallelism and the Limits of Languages

While we make heavy use of the core of Clojure, we don't use its concurrency primitives (atoms, refs, STM, etc.) because a function like pmap doesn't have enough fine grained control for our needs. We opt instead to build our own concurrency abstractions in Clojure on top of the outstanding java.util.concurrent package.

 Once upon a time, in the days when RAM was not quite as cheap as it is now, I worked on a team that had a giant in-memory cache of data that it used to do various financial calculations. This data had, with the growth of the financial markets it represented, grown to be too big to fit into the memory of a standard server without having GC absolutely wreck the process. Having exhausted all minor efforts to tune the data, and not wanting to buy an actual supercomputer, the team turned to Azul, who at that time sold boxes with enough memory to fit our process and a fairly nifty no-pause GC. Great! Except there was one catch: instead of having the super fast CPUs we were used to, it had hundreds of tiny little slow cores (with rather crappy fpus). And on these hundreds of tiny little cpus our code was very, very slow. So we set out to parallelize.

When you're forced into parallelism at that level of detail, you quickly realize how difficult it is to have a generic way of doing it. The fact that languages like scala and clojure will allow you to just say "apply this logic over the collection in parallel" is very cool, and a great way for developers to dip their toes in parallel processing of their collection data. But it breaks down quickly when you are talking about vastly different types of data running in vastly different types of computation. If you need to eke out every ounce of performance, you want every computation to be tuned to the right batch size and the right number of threads, and the size of the thread pool for task X may depend on whether it tends to run at the same time as task Y or task Z. Don't forget that task Z, if requested, must be prioritized above all else. The size of the batches themselves may depend on the type of object inside them, and how that affects the cache lines of the processor you're running on. Oh, and of course, you're not just tuning this for prod, it has to at least run to completion on your integration boxes and they are of course slightly different processors. And don't forget, your developers aren't writing code on Azul boxes, they're writing code on desktops with 2 cores each; threading even the test data to a pool of 50 doesn't make any sense, but they still need to run things in parallel or they'll never see the stupid threading bugs they added when they forgot to use concurrent data structures.

I've always thought that the next truly big leap in VMs and languages will be those that can do smart things with threading without making developers sweat too much. If we're really entering a world of more and more data over more but slower cores, we need a language and a VM that revolutionize our management of threads the way Java and the JVM have revolutionized our notion of memory management. The functional languages out there help, but you can't forget the underlying VM magic that will really make this possible. Look to the example of memory management: it's not Java itself that enables you to write code without worrying about deleting references only when they are truly not being used, but rather the VM that watches all objects and knows how to find the live ones and delete those that are no longer referenced, all without the developer needing to indicate a thing. For us to conquer the thread problem, we need a VM to observe the behavior of the system beyond just the number of cores and help us to appropriately occupy CPUs over collections of varying size and composition. Languages can force developers to think about applying logic over their data (and thus help with parallelism as a mindset), but they will never solve the problem of automated parallelism without the VM to tune it in the background.

There will always be people that need to tune their code down to exactly what instructions run on what core, just as there are still people that need to do their own memory management, but it shouldn't be a requirement for getting good mileage out of those cores. I can't wait for the day when calling submit feels as archaic as calling malloc.