Intuition, Effort, and Debugging Distributed Systems

I recently watched this great talk by Coda Hale, "The Programming Ape". It's heavily influenced by Thinking Fast and Slow, a book about cognitive processes and biases. One of the major points of the book, and the talk, is that we have two types of thinking: intuitive thinking, which is fast, easy, creative, and sloppy, and attention-based thinking, which is harder, but more accurate.

One of the great points that Coda makes in his talk is that most of the ways we do things in software development are very attention-heavy. At the most basic level, writing correct code requires a level of sustained attention that none of us possesses 100% of the time, which is why testing (particularly automated testing) is such an essential part of quality software development. Attention doesn't stop when you get the code into production, you still have the problem of monitoring, which often comes in the form of inscrutable charts or messages that take a lot of thought to parse. Automation helps here, but as anyone that has ever silenced a Nagios alert like a too-early alarm clock knows, the current state of automation has limits when it hits up against our attention.

By far the most attention-straining thing I do on a regular basis is debugging distributed systems. Debugging anything is a very attention-heavy process; even if you have good intuition about where the problem may lie, you still have to read the code, possibly step through it in a debugger or read through a log output and try to find the error. Debugging errors in the interaction between distributed systems is several times more difficult. A debugger is often of no help, at least not initially, because you have to get a series of events to happen in a particular way to trigger the bug. Identifying that series of events in most cases requires staring hard at a series of log files and/or system state dumps, and trying to piece together the ordering based on timestamps that may slightly differ between systems. I consider myself to be a very good debugger and it still took me a solid 4 hours of deep concentration, searching through and replaying transaction logs before I was able to crack through this particular bug. I would never hold the ZooKeeper code base up as a paragon of debugability, but what can we do to make this easier?

When you're writing a distributed system, think hard about what you log. This may be impossible to always get right, but so often the only way you have to find that bug is log files from around the time it happened. If you're going to reconstruct a series of events, you need evidence that those events happened, and you need to know when they happened. Should you rely on the clocks of the machines to line up enough to put the time series together, and should you fail the system if the clocks are too far apart? Since it's a distributed system, is there a way for all of the members of the system to agree on a clock that you could use for logging? As for the events themselves, it is important to be able to easily identify them, their particular behavior, and the state they are associated with (the session that made this request, for example).

One of the problems with ZooKeeper logs, for example, is that they don't do a great job of highlighting important events and state changes. Look at this, does it make your eyes glaze over immediately?

Events are hidden towards the ends of lines, in the middle of output (type:setData, type:create). Important identifiers are held in long hex strings like 0x773516a5076a0000, and it's hard to remember which server/connection they are associated with. To debug problems I have to rely on pen/paper or notepad records of what session id goes with what machine and what the actual series of events was on each of the quorum peers. Very little is scannable and it makes debugging errors a very tedious and attention-heavy process.

Ideally, we want to partially automate debugging. To do this, the logs have to be written in a form that an automated system could parse and reason about. Perhaps we should log everything as JSON. There's a tradeoff though, now a human debugger probably needs another tool to parse the log files at all. This might not be a bad thing. Insisting on basic text for logging leaves out the huge potential win of formatting that can draw the eye to important information in ways other than just text.

Are there tools out there now to aid in distributed debugging? A quick google search shows several scholarly papers and not much else, but I would guess that given the ever-increasing growth of distributed systems we'll see some real products in this area soon. In the meantime, we're stuck with our eyes and our attention, so we might as well think ahead about how we can work with our intuitive systems instead of against them.