ZooKeeper and the Distributed Operating System

From the draft folder, this sat moldering for a few months. Since I think the topic of "distributed coordination as a library/OS fundamental" has flared up a bit in recent conversations, I present this without further editing.

I'm preparing to give a talk at Ricon East about ZooKeeper, and have been thinking a lot of what to cover in this talk. The conference is focused on distributed systems at a somewhat advanced level, so I'm thinking about topics that expand beyond the basics of "what is ZooKeeper" and "how do you use it." After polling twitter and getting some great feedback I've decided to focus on the question that many architects face: When should I use ZooKeeper, and when is it overkill?

This topic is interesting to me in many ways. In my current job as VP of Architecture at Rent the Runway, we do not yet use ZooKeeper. There are things that we could use it for, but in our world most of the distributed computing we do is pure horizontally scalable web services. We're not yet building out complex networks of servers with different roles that need to be centrally configured, managed, or monitored beyond what you can easily do with simple load balancers and nagios. And many of the questions I answer on the ZooKeeper mailing list are those that start with "can ZK do this?" The answer that I prefer to give is almost always "yes, but keep these things in mind before you roll it out for this purpose." So that is what I want to dig into more in my talk.

I've been digging into a lot of the details of ZAB, Paxos, and distributed coordination in general as part of the talk prep, and hit on an interesting thought: What is the role of ZooKeeper in the world of distributed computing? You can see a very clear breakdown right now in major distributed systems out there. There are those that are full platforms for certain types of distributed computing: the Hadoop ecosystem, Storm, Solr, Kafka, that all use ZooKeeper as a service to provide key points of correctness and coordination that must have higher transactional guarantees than these systems want to build intrinsically into their own key logic. Then there are the systems, mostly distributed databases, that implement their own coordination logic: MongoDB, Riak, Cassandra, to name a few. This coordination logic often makes different compromises than a true independent Paxos/ZAB implementation would make; for an interesting conversation check out a Cassandra ticket on the topic.

In thinking about why you would want to use a standard service-type system vs implementing your own internal logic, it reminds me very much of the difference between modern SQL databases and the rest of the application world. The best RDBMSs are highly tuned beasts. They cut out the middleman as much as possible, taking over functionality from the OS and filesystem as it suits them to get absolutely the best performance for their workload. This makes sense. The competitive edge to the product they are selling is its performance under a very well-defined standard of operation (SQL with ACID guarantees), as well as ease of operation. And in the new world of distributed databases, owning exactly the logic for distributed coordination (and understanding where that logic falls apart in the specific use cases for that system) will very likely be a competitive edge for a distributed database looking to gain a larger customer base. After all, installing and administering one type of thing (the database itself) is by definition simpler than installing and administering 2 things (the database plus something like ZooKeeper). It makes sense to prefer to burn your own developer dollars to engineer around the edge cases, so as to make a simpler product for your customers.

But ignoring the highly tuned commercial case of distributed databases, I think that ZooKeeper, or a service like it, is a necessary core component of the "operating system" for distributed computing. It does not make sense for most systems to implement their own distributed coordination, any more than it makes sense to implement your own file system to run your RESTful web app. Remember, to do distributed coordination successfully requires more than just, say, a client library that perfectly implements Paxos. Even with such a library, you would need to design your application up-front to think about high availability. You need to deploy it from the beginning with enough servers to make a sane quorum. You need to think about how the rest of the functioning of your application (say, garbage collection, startup/shutdown conditions, misbehavior) will affect the functioning of your coordination layer. And for most of us, it doesn't make sense to do that up-front. Even the developers at Google didn't always think in such terms, the original Chubby paper from 2006 mentions most of these reasons as driving the decision to create a service rather than a client library.

Love it or hate it, ZooKeeper or a service like it is probably going to be a core component of most complex distributed system deployments for the foreseeable future. Which is all the more reason to get involved and help us make it better.

Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper

This is the written version of a presentation I made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

The Problem:
Create a "dynamic discovery" service for a global company. This allows servers to be found by clients until they are shut down, remove their advertisement, or lose their network connectivity, at which point they are automatically de-registered and can no longer be discovered by clients. ZooKeeper ephemeral nodes are used to hold these service advertisements, because they will automatically be removed when the ZooKeeper client that made the node is closed or stops responding.

This service should be available globally, with expected "service advertisers" (servers advertising their availability, aka, writers) able to scale to the thousands, and "service clients" (servers looking for available services, aka, readers) able to scale to the tens of thousands. Both readers and writers may exist in any of three global regions: New York, London, or Asia. Each region has two datacenters with a fat pipe between them, and each region is connected to each other region, but these connections are much slower and less tolerant for piping large quantities of data.

This service should be able to withstand the loss of any one entire data center.

As creators of the infrastructure, we control the client that connects to this service. While this client wraps the ZooKeeper client, it does not have to support all of the ZooKeeper functionality.

Implications and Discoveries:
ZooKeeper requires a majority (n/2 + 1) of servers to be available and able to communicate with each other in order to form a quorum, and thus you cannot split a quorum across two data centers and guarantee that the quorum will be available with the loss of any one data center (because at least one data center will fail to have a pure majority of servers). To sustain the loss of a datacenter therefore you must split your cluster across 3 data centers.

Write speed dramatically decreases when the quorum must wait for votes to travel over the WAN. We also want to limit the number of heartbeats that must travel across the WAN. This means that both a ZooKeeper cluster with nodes spread across the globe is undesirable (due to write speed), and a ZooKeeper cluster with members only in one region is also undesirable (because writing clients outside of that region would have to continue to heartbeat over the WAN). Even if we decided to have a cluster in only one region, we would have to solve the problem that no region has more than 2 data centers, and we need 3 data centers to handle the loss/network partition of an entire data center.

Create 3 regional clusters to support discovery for each region. Each cluster has N-1 nodes split across the 2 local data centers, with the final node in the nearest remote data center.

By splitting the nodes this way, we guarantee that there is always availability if any one data center is lost or partitioned from the rest of the data centers. We also minimize the affects of the WAN on write speed by ensuring that the remote quorum member is never made into the leader node, and the general effect of the majority of nodes being local means that voting can complete (thus allowing writes to finish) without waiting for the vote from the WAN node in normal operating conditions.

3 Separate Global Clusters, One Global Service:
Having 3 separate global clusters works well for infrastructural reasons mentioned above, but it has the potential to be a headache for the users of the service. They want to be able to easily advertise their availability, and discover available servers preferably by those servers available first in their local region, and secondly in other remote regions if no local servers are available.

To do this, we wrapped our ZooKeeper client in such a way as to support the following paradigm:
Advertise Locally
Lookup Globally

Operations requiring a continuous connection to the ZooKeeper, such as advertise (which writes an ephemeral node) or watch are only allowed on the local discovery cluster. Using a virtual IP address we automatically route connections to the discovery service address of the local ZooKeeper cluster and write our ephemeral node advertisement here.

Lookups do not require a continuous connection to the ZooKeeper, and so we can support global lookups. Using the same virtual IP address we can connect to the local cluster to find local servers, and failing that use a deterministic fallback to remote ZooKeeper clusters to discover remote servers. The wrapped ZooKeeper client will automatically close its connection to the remote clusters after a period of client inactivity, so as to limit WAN heartbeat activity.

Lessons learned:
ZooKeeper as a Service (a shared ZooKeeper cluster maintained by a centralized infrastructure team to support many different clients) is a risky proposition. It is easy for a misbehaving client to take down an entire cluster by flooding it with requests or making too many connections and without a working hard quota enforcement system clients can easily push too much data into ZooKeeper. Since ZooKeeper keeps all of its nodes in memory, a client writing huge numbers of nodes with a lot of data in each can cause ZooKeeper to garbage collect or run out of memory, bringing down the entire cluster.

ZooKeeper has a few hard limits. Memory is a well-known limit, but another limit is the number of sockets for a server process (configured via the ulimit in *nix). If a node runs out of sockets due to too many client connections, it will basically cease to function without necessarily crashing. This is not surprising for anyone that has experienced this problem in other Java servers, but it is worth noting when scaling your cluster.

Folks using ZooKeeper to do this sort of dynamic discovery platform should note that if the services you are advertising are Java services, a long full GC pause can cause their session to the ZooKeeper cluster to time out and thus their advertisement will be deleted. This is generally probably a good thing, because a server that is doing a long-running full GC won't respond to client requests to connect, but it can be surprising if you are not expecting it.

Finally, I often get the question of how to set the heartbeats, timeouts, etc, to optimize a ZooKeeper cluster, and the answer is really that it depends on your network. I really recommend playing with Patrick Hunt's zk-smoketest in your data centers to figure out sensible limits for your cluster.

Intuition, Effort, and Debugging Distributed Systems

I recently watched this great talk by Coda Hale, "The Programming Ape". It's heavily influenced by Thinking Fast and Slow, a book about cognitive processes and biases. One of the major points of the book, and the talk, is that we have two types of thinking: intuitive thinking, which is fast, easy, creative, and sloppy, and attention-based thinking, which is harder, but more accurate.

One of the great points that Coda makes in his talk is that most of the ways we do things in software development are very attention-heavy. At the most basic level, writing correct code requires a level of sustained attention that none of us possesses 100% of the time, which is why testing (particularly automated testing) is such an essential part of quality software development. Attention doesn't stop when you get the code into production, you still have the problem of monitoring, which often comes in the form of inscrutable charts or messages that take a lot of thought to parse. Automation helps here, but as anyone that has ever silenced a Nagios alert like a too-early alarm clock knows, the current state of automation has limits when it hits up against our attention.

By far the most attention-straining thing I do on a regular basis is debugging distributed systems. Debugging anything is a very attention-heavy process; even if you have good intuition about where the problem may lie, you still have to read the code, possibly step through it in a debugger or read through a log output and try to find the error. Debugging errors in the interaction between distributed systems is several times more difficult. A debugger is often of no help, at least not initially, because you have to get a series of events to happen in a particular way to trigger the bug. Identifying that series of events in most cases requires staring hard at a series of log files and/or system state dumps, and trying to piece together the ordering based on timestamps that may slightly differ between systems. I consider myself to be a very good debugger and it still took me a solid 4 hours of deep concentration, searching through and replaying transaction logs before I was able to crack through this particular bug. I would never hold the ZooKeeper code base up as a paragon of debugability, but what can we do to make this easier?

When you're writing a distributed system, think hard about what you log. This may be impossible to always get right, but so often the only way you have to find that bug is log files from around the time it happened. If you're going to reconstruct a series of events, you need evidence that those events happened, and you need to know when they happened. Should you rely on the clocks of the machines to line up enough to put the time series together, and should you fail the system if the clocks are too far apart? Since it's a distributed system, is there a way for all of the members of the system to agree on a clock that you could use for logging? As for the events themselves, it is important to be able to easily identify them, their particular behavior, and the state they are associated with (the session that made this request, for example).

One of the problems with ZooKeeper logs, for example, is that they don't do a great job of highlighting important events and state changes. Look at this, does it make your eyes glaze over immediately?

Events are hidden towards the ends of lines, in the middle of output (type:setData, type:create). Important identifiers are held in long hex strings like 0x773516a5076a0000, and it's hard to remember which server/connection they are associated with. To debug problems I have to rely on pen/paper or notepad records of what session id goes with what machine and what the actual series of events was on each of the quorum peers. Very little is scannable and it makes debugging errors a very tedious and attention-heavy process.

Ideally, we want to partially automate debugging. To do this, the logs have to be written in a form that an automated system could parse and reason about. Perhaps we should log everything as JSON. There's a tradeoff though, now a human debugger probably needs another tool to parse the log files at all. This might not be a bad thing. Insisting on basic text for logging leaves out the huge potential win of formatting that can draw the eye to important information in ways other than just text.

Are there tools out there now to aid in distributed debugging? A quick google search shows several scholarly papers and not much else, but I would guess that given the ever-increasing growth of distributed systems we'll see some real products in this area soon. In the meantime, we're stuck with our eyes and our attention, so we might as well think ahead about how we can work with our intuitive systems instead of against them.

Networking woes in Java

The only major CS subject I never took a class in was networking. It's kind of ridiculous, looking back, that I took as many systems classes as I did but always eschewed networking. I do own a copy of UNIX Network Programming: Networking APIs: Sockets and XTI; Volume 1, bought at some point in the past when I knew I was going to be doing some distributed systems work and figured it would be a useful reference. But I can't say it's been my constant companion. For I have learned one thing in my years of Java systems coding:

Networking code is HARD.

Here's exhibit A: ZooKeeper monitoring misuses sockets. I spent a good chunk of time desperately trying to figure out why my monitoring commands were crapping out halfway through when run from NY to LN. Turns out, you can't safely expect to just close half a socket, leave the other half open, push some data to it and then close it while seeing all the data through to the other side. Not without a final handshake indicating the client has gotten all the data. Or at least, I think that's the case. The thing is, this will work well enough over a very fast network connection or with very little data. The guarantees around so_linger etc change kernel to kernel and my reading at the time led me to think that in fact the standard linux kernel behavior in this case may well have changed over the years that ZooKeeper has been around. So we need to completely rip out and redo the monitoring code if we want to have any hope of this working right for other big, global deployments in the future.

Exhibit B is my current debugging nightmare. Part of our release last week involved a new backend Play service that itself connects to a different backend Play service to prepare results for our storefront. We noticed, several hours after launch, that the service started to throw exceptions that were ultimately caused by java.io.IOException: Too many open files. I know enough about Java to know that running out of file descriptors is often a Bad Thing. 

So we're leaking sockets. Why? To date, we don't know. The underlying libraries are async-http-client and netty, but there's very little to indicate what is going on.1 The sockets show up in netstat/lsof as ESTABLISHED TCP connections to the various storefront servers. But the storefront servers do not have most of these sockets open on their end. How are they ESTABLISHED with no partner? It's an ongoing mystery, one that we haven't been able to reproduce on any other machine (the current theory is bad network hardware/software at the lowest levels, but honestly that's just a shot in the dark and one that we can't verify without taking down a production service).

So, while I keep debugging, what are the takeaways here?

1) You shouldn't write your own socket handling code in Java. Really, no. Don't do it. Use Netty. It's very good. Of all the things not to reinvent yourself, I would put networking at the top of the list with a bullet. It's hard, and requires the kind of deep expertise that you can't fake. And, when you fake it, you may end up with something like our ZooKeeper monitoring, that seems to work for years while hiding small but significant bugs.

2) If you're a system architect writing any kind of web services/distributed system architecture, you should know your unix socket monitoring commands. lsof is obtuse but powerful. netstat is simpler and still quite useful. This article has a few others, like ss and iftop. Know how to up the ulimits for your processes in case you find yourself with a slow socket leak that you need time to debug.

Have an idea what my bug is? I'd love to hear it! Leave me a comment or hit me up on twitter!

Edit 2/27: Looks like our bug was indeed on the cloud vendor side; possibly a misconfigured firewall. Moving to a new box and rebuilding the box we were on solved it.

1 Thank God Play is at least using good networking libraries, because the last time I tested ZooKeeper, when it runs out of sockets the service hard fails with almost no indication of what happened. 

2011: My Year of Open Source

2011 saw a lot of big events in my life. I got hip surgery early in the year. I found myself thinking of leaving my job of six plus years in the summer. I actually left that job in the fall, and took the big leap into startup land in November. But when I think of 2011, I think the biggest changing influence for me was my entry into the world of open source.

I would call my evolution as a developer a three phase project. First, getting all the fundamentals of computing beaten into me in undergrad and graduate school. Second, learning how to be productive in the working world, the gritty details of actually producing production code and solving problems that sometimes are purely technical but often are a matter of orchestration, attention to detail, and engineering. Finally, combining these two aspects, and putting these talents to use in something that touches developers all over the world.

I fell into the ZooKeeper community by happy chance. I had been given a project to implement a company-wide dynamic discovery service. The developers that had come before me had found ZooKeeper, but had the luxury of implementing a solution that didn't have to scale to the volume and geographic diversity of the whole firm. I had requirements for global scaling and entitlements that didn't seem to be common in the ZooKeeper community at that point, and so I was forced to do more than just comb over the documentation to design my system. I cracked open the source code and got to work learning how it really worked.

My first bug was a simple fix to the way failed authentication was communicated to the Java client library. I had to get approvals almost up to the CTO level to be able to participate, but it was worth the effort. Quickly, I started feeling more responsibility to the community. I was, after all, relying on this piece of software to give me a globally available 24/7 system, and I wanted to be able to support clients where downtime could mean trading losses in the millions. I owed it to my own infrastructure to help fix bugs, and really, it was fun. I love writing distributed systems, and the ZooKeeper code base is a pleasure to work with; a little baroque, but just enough to be a fun challenge, and most of the complexity is in the fundamental problem. 

Working in the community has not just been about fixing hard bugs. It's also been about those engineering and teamwork considerations that are beguiling on a co-located team, and working on a team that I communicated with entirely though email and Jira was intense. Lucky for me, the ZooKeeper community is some of the most mature engineers I've ever had the pleasure of working with. We pull together to solve hard bugs, we all participate in answering questions and we try to make everyone feel welcome to participate. I consider the community to be the textbook example of open source done right.

Working with this community, working in public after being sequestered in the tightly controlled environment of finance for so many years, flipped something in my brain. I realized that I wanted to be able to live out loud, as it were. I value openness, the ability to work with people all over the world, the ability to work in public, getting feedback and appreciation from the wider community of developers. It also gave me confidence that I could be productive outside of the comfort zone of the place I had worked for years, and that I could show a degree of leadership even without an official title.

In the end, this experience freed me from feeling tied to the corporate life I had been living. I feel open to choose the path I want as a developer. The startup world of today has embraced this open source mentality, which I think is one of the most exciting developments of the last five years. So, I chose to go to a new job that I knew would let me live out loud. 

If you're not already in the open source community, why not crack open your favorite open source project and make 2012 your year of open source? 

ZooKeeper 3.4: Lessons Learned

After several months on the planning block, it looks like ZooKeeper 3.4 is finally almost ready to be released. (Edit: Hooray! As of 11/22, release 3.4 is available!) I can say with confidence that all of the committers for the project have learned a lot from the course of this release. And most of it is in the form of "ouch, lessons learned".

First lesson: Solidify your new feature set early.
Going through the Jira, the earliest new feature for the 3.4 release is the uplift of the ZAB protocol to ZAB1.0. No small feature, to be sure, we were still debugging minor issues with it through the very end stages of our 3.4 work. We also added multi transactions, kerberos support, a read-only zookeeper, netty, windows support for C, and certainly others I'm forgetting. Some of these features were pretty simple uplifts, but some of them caused us build instability for months and a great deal of distraction. Many of these were added as "just one more feature". But many other features were neglected because "we're almost ready for 3.4" (as it turned out, often not actually the case). If we had decided early what new major features we were pushing for with 3.4, we could have concentrated our efforts more effectively and delivered much sooner.

Second lesson: When it's time to push, push.
Giving birth requires a period of concentrated pushing. If you think you can push a little now, then put it off for a few days, then a bit now, then a few weeks off... the baby will never come, and neither will the release. It took several attempts before the community finally rallied behind the efforts to get a release out, and we ended up losing a lot of momentum in the process. We didn't have a solid and pre-agreed-upon features to know when we were done, so things just kept getting in the way. When the attention on the release was off, a minor bug or feature request would come in and it just seemed so small, what was the harm?

Third lesson: Prioritize as a community, and stick to those priorities
This falls in with setting up a feature list early, but it goes beyond that. Our community was split between those who were very interested in seeing 3.4 released, and those who were working on major new changes or refactorings against trunk. As a result we all ended up feeling shortchanged. Contributors with new features did not get the attention their features needed, and many still sit in unreviewed patch form. Users that were hungry for the 3.4 release were frustrated with our lack of attention to getting it out. We had some massive new refactoring efforts that continued to happen on trunk during the course of the release process, which resulted in a frustrated committer base stuck backporting or forwardporting patches between increasingly divergent branches. These efforts found bugs, but not without some cost. Having unclear priorities divided the community, caused some tension, and ultimately slowed the whole release process down.

Fourth lesson: You can always do more releases, it doesn't all have to happen now
This is perhaps my own biggest takeaway from this process. I wish we had done much less, done it much faster, and been willing to release a 3.4 that was quickly followed by 3.4.1, 3.5, etc, as needed. Proponents of agile development and release practices have a good point; the more often you release, the less there is to go wrong and the easier it will be to fix if and when it does. It becomes a self-fulfilling prophecy. We don't release frequently so people want to cram as many new features in as possible, which slows down the releases, which results in pushes for more new features, which results in more bugs and slowed down releases, and on and on.

These lessons may seem obvious in retrospect, but they came at the price of many people's time and effort. I'm proud of our community for pulling together in the end, but I also hope that 3.5 will be a different and less arduous journey.