War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Intuition, Effort, and Debugging Distributed Systems

I recently watched this great talk by Coda Hale, "The Programming Ape". It's heavily influenced by Thinking Fast and Slow, a book about cognitive processes and biases. One of the major points of the book, and the talk, is that we have two types of thinking: intuitive thinking, which is fast, easy, creative, and sloppy, and attention-based thinking, which is harder, but more accurate.

One of the great points that Coda makes in his talk is that most of the ways we do things in software development are very attention-heavy. At the most basic level, writing correct code requires a level of sustained attention that none of us possesses 100% of the time, which is why testing (particularly automated testing) is such an essential part of quality software development. Attention doesn't stop when you get the code into production, you still have the problem of monitoring, which often comes in the form of inscrutable charts or messages that take a lot of thought to parse. Automation helps here, but as anyone that has ever silenced a Nagios alert like a too-early alarm clock knows, the current state of automation has limits when it hits up against our attention.

By far the most attention-straining thing I do on a regular basis is debugging distributed systems. Debugging anything is a very attention-heavy process; even if you have good intuition about where the problem may lie, you still have to read the code, possibly step through it in a debugger or read through a log output and try to find the error. Debugging errors in the interaction between distributed systems is several times more difficult. A debugger is often of no help, at least not initially, because you have to get a series of events to happen in a particular way to trigger the bug. Identifying that series of events in most cases requires staring hard at a series of log files and/or system state dumps, and trying to piece together the ordering based on timestamps that may slightly differ between systems. I consider myself to be a very good debugger and it still took me a solid 4 hours of deep concentration, searching through and replaying transaction logs before I was able to crack through this particular bug. I would never hold the ZooKeeper code base up as a paragon of debugability, but what can we do to make this easier?

When you're writing a distributed system, think hard about what you log. This may be impossible to always get right, but so often the only way you have to find that bug is log files from around the time it happened. If you're going to reconstruct a series of events, you need evidence that those events happened, and you need to know when they happened. Should you rely on the clocks of the machines to line up enough to put the time series together, and should you fail the system if the clocks are too far apart? Since it's a distributed system, is there a way for all of the members of the system to agree on a clock that you could use for logging? As for the events themselves, it is important to be able to easily identify them, their particular behavior, and the state they are associated with (the session that made this request, for example).

One of the problems with ZooKeeper logs, for example, is that they don't do a great job of highlighting important events and state changes. Look at this, does it make your eyes glaze over immediately?

Events are hidden towards the ends of lines, in the middle of output (type:setData, type:create). Important identifiers are held in long hex strings like 0x773516a5076a0000, and it's hard to remember which server/connection they are associated with. To debug problems I have to rely on pen/paper or notepad records of what session id goes with what machine and what the actual series of events was on each of the quorum peers. Very little is scannable and it makes debugging errors a very tedious and attention-heavy process.

Ideally, we want to partially automate debugging. To do this, the logs have to be written in a form that an automated system could parse and reason about. Perhaps we should log everything as JSON. There's a tradeoff though, now a human debugger probably needs another tool to parse the log files at all. This might not be a bad thing. Insisting on basic text for logging leaves out the huge potential win of formatting that can draw the eye to important information in ways other than just text.

Are there tools out there now to aid in distributed debugging? A quick google search shows several scholarly papers and not much else, but I would guess that given the ever-increasing growth of distributed systems we'll see some real products in this area soon. In the meantime, we're stuck with our eyes and our attention, so we might as well think ahead about how we can work with our intuitive systems instead of against them.

Networking woes in Java

The only major CS subject I never took a class in was networking. It's kind of ridiculous, looking back, that I took as many systems classes as I did but always eschewed networking. I do own a copy of UNIX Network Programming: Networking APIs: Sockets and XTI; Volume 1, bought at some point in the past when I knew I was going to be doing some distributed systems work and figured it would be a useful reference. But I can't say it's been my constant companion. For I have learned one thing in my years of Java systems coding:

Networking code is HARD.

Here's exhibit A: ZooKeeper monitoring misuses sockets. I spent a good chunk of time desperately trying to figure out why my monitoring commands were crapping out halfway through when run from NY to LN. Turns out, you can't safely expect to just close half a socket, leave the other half open, push some data to it and then close it while seeing all the data through to the other side. Not without a final handshake indicating the client has gotten all the data. Or at least, I think that's the case. The thing is, this will work well enough over a very fast network connection or with very little data. The guarantees around so_linger etc change kernel to kernel and my reading at the time led me to think that in fact the standard linux kernel behavior in this case may well have changed over the years that ZooKeeper has been around. So we need to completely rip out and redo the monitoring code if we want to have any hope of this working right for other big, global deployments in the future.

Exhibit B is my current debugging nightmare. Part of our release last week involved a new backend Play service that itself connects to a different backend Play service to prepare results for our storefront. We noticed, several hours after launch, that the service started to throw exceptions that were ultimately caused by Too many open files. I know enough about Java to know that running out of file descriptors is often a Bad Thing. 

So we're leaking sockets. Why? To date, we don't know. The underlying libraries are async-http-client and netty, but there's very little to indicate what is going on.1 The sockets show up in netstat/lsof as ESTABLISHED TCP connections to the various storefront servers. But the storefront servers do not have most of these sockets open on their end. How are they ESTABLISHED with no partner? It's an ongoing mystery, one that we haven't been able to reproduce on any other machine (the current theory is bad network hardware/software at the lowest levels, but honestly that's just a shot in the dark and one that we can't verify without taking down a production service).

So, while I keep debugging, what are the takeaways here?

1) You shouldn't write your own socket handling code in Java. Really, no. Don't do it. Use Netty. It's very good. Of all the things not to reinvent yourself, I would put networking at the top of the list with a bullet. It's hard, and requires the kind of deep expertise that you can't fake. And, when you fake it, you may end up with something like our ZooKeeper monitoring, that seems to work for years while hiding small but significant bugs.

2) If you're a system architect writing any kind of web services/distributed system architecture, you should know your unix socket monitoring commands. lsof is obtuse but powerful. netstat is simpler and still quite useful. This article has a few others, like ss and iftop. Know how to up the ulimits for your processes in case you find yourself with a slow socket leak that you need time to debug.

Have an idea what my bug is? I'd love to hear it! Leave me a comment or hit me up on twitter!

Edit 2/27: Looks like our bug was indeed on the cloud vendor side; possibly a misconfigured firewall. Moving to a new box and rebuilding the box we were on solved it.

1 Thank God Play is at least using good networking libraries, because the last time I tested ZooKeeper, when it runs out of sockets the service hard fails with almost no indication of what happened.