War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Java console monitoring basics: The "j" series

Think fast: It's 10pm, you have a production java application on a box you can only ssh into, and it's in distress. This is the third time this month it's happened. You didn't write the code, and the joker who did didn't bother to put in any metrics for you to grab. What do you do? After cursing, but before giving up, restarting the process, and promising to debug it in the morning, you might want to go a round with the Java command-line monitoring tools.

Perhaps you are already familiar with these tools, but I've found that despite their incredible usefulness many seasoned Java developers have never heard of this tool stack. They hide away in the jdk bin directory, but they can be your best friend when you are stuck with nothing but a console and a prayer.

First we have the lowly jps. It does what you would expect: shows you java processes running as that user (or all users, if you're root).

Moving up the stack, we have jinfo. Show me the vm version, jars, and all the flags this process is running with, or the value of a particular flag. You may have this information elsewhere, but it's nice to have a shortcut.

More useful is jstack. Yeah, I'm sure you know how to kill -QUIT but this is a nice way to teach the newbie how to get a stack trace without the risk that they'll accidentally kill the process. If you should be so lucky as to have an obvious deadlock, jstack will kindly point that out to you so you can go fix it. Stack traces are the bread and butter of figuring out what's wrong with a process. Take several, see how they change, or don't, and you'll come closer to finding your problem code.

My personal all-time favorite is jstat. More specifically, jstat -gc 3s. That is, jstat of the garbage collector printing new results every 3 seconds. Back in the days when I wrangled gigantic VMs, this tool was invaluable for spot-checking garbage collection. The output is admittedly hideous. For example:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT
37568.0 41984.0 6592.0  0.0   74816.0  38651.6   70272.0    39290.5   63616.0 55956.0     38    1.608   4      1.441    3.049

Yes, that unformatted barf is what you will see on a screen console output for jstat -gc. But that barf tells you a lot. First things first, the most useful stuff is at the very end. See that "4" followed by "1.441"? That shows the stop-the-world GC collections, and the total time they have taken. If your application is running particularly slow, or is frequently unresponsive, the FGC count will quite possibly be high, and the time related to it will be very high. Remember, your app is essentially dead when FGC is running, so a high number of FGCs is a bad sign.

This also shows you the various used and total sizes of the generations. I'm not going to go into details of Java Garbage Collection but it is useful to be able to see them broken out and growing or shrinking. One pattern of GC to watch out for is the case when you don't have enough survivor/eden space to handle all of the transient data you need to do a big chunk of work, and not enough old capacity to take it all, but just enough freeable state to keep slowly moving forward with your computation. The result will be neverending full-gcs while your process moves slightly forward, does a full GC, moves forward again, does a full GC, and moves forward again. This looks in jstat like an ever-increasing series of FGC where each one finishes but only causes a small amount of space to be freed in OU, while EU and the survivor spaces are constantly near-full. The details of the actual GC behavior may change with versions of the JVM (and the flags that you are using), but the ability to monitor the behavior easily in real-time is always useful.

Finally, for the tools I usually use, there's jmap. jmap -heap is a nice way to get a pretty-print of the heap info. jmap -histo and jmap -dump are heavier commands that you might want to hold off on doing until you're ready to restart your process anyway, because sometimes running them will result in bad results for the process. If you're producing a ton of garbage and you don't know why, jmap can show you where the memory is going. jmap -dump will product a file that you can push into something like MAT for analysis.

None of these tools are an answer for proper monitoring of your JVM, but they're great for a quick and dirty debug before you restart a troubled process, and something to make sure your whole team has in their toolbox.