software development

Branching Is Easy. So? Git-flow Is Not Agile.

I've had roughly the same conversation four times now. It starts with the question of our deployment/development strategy, and some way in which it could be tweaked. Inevitably, someone will bring up the well-known git branching model blog post. They ask, why not use this git-flow workflow? It's very well laid out, and relatively easy to understand. Git makes branching easy, after all. The original blog post in fact contends that because branching and merging is extremely cheap and simple, it should be embraced.
As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.
But here's the thing: There are reasons beyond tool support that would lead one to want to encourage or discourage branching and merging, and mere tool support is not reason enough to embrace a branch-driven workflow.

Let's take a moment to remember the history of git. It was developed by Linus Torvalds for use on the Linux project. He wanted something that was very fast to apply patches, and supported the kind of distributed workflow that you really need if you are supporting a huge distributed team. And he made something very, very, very fast, great for branching and distributed work, and difficult to corrupt.

As a result git has many virtues that align perfectly with the needs of a large distributed team. Such a team has potentially long cycles between an idea being discussed, being developed, being reviewed, and being adopted. Easy and fast branching means that I can go off and work on my feature for a few weeks, pulling from master all the while, without having a huge headache when it comes to finally merge that branch back into the core code base. In my work in ZooKeeper, I often wish I bothered to keep a git-svn sync going because reviewing patches is tedious and slow in svn. Git was made to solve my version control problems as an open source software provider.

But at my day job, things are different. I use git because a) git is FAST and b) Github. Fast makes so much of a difference that I'm willing to use a tool with a tortured command line syntax and some inherent complexity. Github just makes my life easier, I like the interface, and even through production outages I still enjoy using it. But branching is another story. My team is not a distributed team. We all sit in the same office, working on shared repositories. If you need a code review you can tap the shoulder of the person next to you and get one in 5 minutes. We release frequently; I'm trying to move us into a continuous delivery model that may eventually become continuous deployment if we can get the automation in place. And it is for all of these reasons that I do not want to encourage branching or have it as a major part of my workflow.

Feature branching can cause a lot of problems. A developer working on a branch is working alone. They might be frequently pulling in from master, but if everyone is working on their own feature branch, merge conflicts can still hit hard. Maybe they have set things up so that an automated build will still run through every push they make to that branch, but it's just as likely that tests are only being run locally and the minute this goes into master you'll see random failures due to the various gremlins of all software development. Worst of all, it's easy for them to work in the dark, shielded from the eyes of other developers. The burden of doing the right thing is entirely on the developer and good developers are lazy (or busy, or both). It's too easy to let things go for too long without code review, without integration, and without detecting small problems. From a workflow perspective, I want something that makes small problems come to light very early and obviously to the whole team, enabling inherent communication. Branching doesn't fit this bill.

Feature branching also encourages thinking about code and features as all or none. That makes sense when you are delivering a packaged, versioned product that others will have to download and install (say, Linux, or ZooKeeper, or maybe your iOS app). But if you are deploying code to a website, there is no need to think of the code in this binary way. It's reasonable to release code behind feature flags that is not complete but flagged off, for purposes of keeping the integration of that new code in for testing in other environments. Learning how to write code in such a way as to be chunkable, flaggable, and almost always safe to go into production is a necessary skill set for frequent releases of any sort, and it's essential if you ever want to reach continuous deployment.

Release branching may still be a necessary part of your workflow, as it is in some of our systems, but even the release branching parts of the git-flow process seems a bit overly complex. I don't see the point in having a develop branch, nor do I see why you would care about keeping master pristine, since you can tag the points in the master timeline where you cut the release branch. (As an aside, the fact that the original post refers to "nightly builds" as the purpose of the develop branch should raise the eyebrows of anyone doing continuous integration.)  If you're not doing full continuous deployment you need to have some sort of branch that indicates where you cut the code for testing and release, and hotfixes may need to go into up to two places, that release branch and master, but git-flow doesn't solve the problem of pushing fixes to multiple places. So why not just have master and release branches? You can keep your release branches around for as long as you need them to get live fixes out, and even longer for historical records if you so desire.

Git is great for branching. So what? Just because a tool offers a feature, and does it well, does not mean that feature is actually important for your team. Building a whole workflow around a feature just because you can is rarely a good idea. Use the workflow that your team needs, don't cargo cult an important element of your development process.

The Siren Songs of Hack Day Projects

I love hack days. I think spending a day every now and then to just work on whatever inspires you is one of the best trends of the last five years. Whether you use it to try out a new idea, fix a problem that's been bothering you for months, or play with a new tool, it's a great use of engineering time.

However, we can't forget the seductive illusion of the hack. When you have one day to make something, and you decide to make something new, you make it however you can. You don't write tests (unless perhaps you are the most enlightened TDDer). You don't worry about edge cases. You don't plan for scale. You're playing! You're having fun! You're creating something beautiful and impressive that can bring new frontiers to your business!

What's the downside? A good hack day produces ideas and projects that you want to put into production. But when the time comes to actually take those projects and make them production-worthy it's often quite a letdown. At Rent the Runway we've seen this happen twice over our two hack days. On the first, the whole team worked on a project to rebuild "Rent the Runway" on a completely new platform. The story goes that they got 80% of the way through in one 24 hour binge. 80%! This incredible feat actually lead us to say that we would move the entire site off of our existing platform within the next 3-4 months. Heck, if we could get 80% of the site rewritten by the whole team working for a day, surely we could get the site rewritten completely in a few months.

I'm a bit embarrassed to admit that we all bought into the hype of the 80% completion for longer than we should've. In reality, the 80% completion was more like "80% of basic functionality completed", which is still very impressive but ignores the 50% of required functionality that goes well beyond basic. A 3-4 month rewrite timeline might be realistic if you could take the entire engineering team, build no new features, and eventually ship something similar but not quite the same. Of course, none of these things are ever true.

The most recent hack day had some of our engineers build a very cool project utilizing our reviews data. So cool, in fact, that our product team latched on to it and decided to make it into a full-fledged major launch. Success! And now... we're just finishing a couple of intense weeks of project planning for what is quickly turning into a man-year's worth of an engineering project. It will be cool when it launches, but I don't think any of us expected this cool thing that took a day to hack to turn into 4 engineers working for three full months just to get a polished v1.

The hack day siren song can also cause you to fall in love with your hack so much you don't see the cost to making it production-ready. I've known of engineers that went off on their own and neglected their other duties for projects that would never see the light of production due to their complexity and questionable business value. It's hard to complain when people are working on something they're passionate about, but it's important for a team to feel like they are all working toward the same goal, and that can be difficult when some members are spending most of their time tinkering with side projects.

I'm jaded and boring, so my most recent hack day project took some images that were stored in the database and put them on the CDN. It took me longer than anticipated, but I wrote tests and made it work in both production and staging and released it that day. (My expected 2 hour hack still took me closer to 6 hours end to end.) It also enabled other hack day projects, which was really the point. At the end of the day, I'm glad most of the engineers I work with use their hack days to shoot for the moon, even if the final results always take longer to achieve than we hope and maybe spill over past that one day. After all, what's the point in fast-loading images if no one ever views them?

On Yaks and Hacks

A Yak Story


Today I released a nice little feature into production that tests promoting a customer's "hearted" items to the top of their search results. I decided that it would be a good test after seeing some analytics data on users of this particular feature, and sold it to the head of analytics and myself as a quick feature to spin up, half a day's dev work for me. We already had the data, I just needed to play with the ordering of the products to display.
Of course, it didn't turn out to be quite that simple. When I opened the code and prepared to add the new feature, I discovered a few things. First, my clean checkout of the code base wouldn't even compile. Apparently in the time I had been away from it, the other devs had renamed a maven dependency, rtr_infra_common, to rtr_services_common, only to discover that they needed both. Instead of starting the new rtr_infra_common from, say, version 2, they started it back at version 1.0. My local maven was very, very confused. 
Having figured that out, I finally had a clean, compiling workspace, so time to code. I have all the data I need to do this right here, right? Except, this "score" field that claims to be in the DTO for our products is not in fact being serialized by the providing service. Oh and in my absence the package that holds that DTO has also gotten refactored into its own project. And rtr_services_common depends on that. And rtr_infra_common depends on THAT. And I depend on  rtr_infra_common. So just to get the new data, I have to update the product catalog service, the DTO, and the whole chain of dependent projects.
As an aside, I think maven is kind of nice, but dependency management systems have caused the biggest yaks I've ever shaved, and those couple of hours of annoyance are a small drop in the bucket. 
So, got my data. But now I find another problem. The framework I'm working in was designed to take all of this data and apply experiments to it before returning it to the front end. But we hadn't actually built out a notion of the "user" beyond a userId for logging and bucketing purposes. There was nowhere to put the personalized data I needed to use to enhance the results. I could just save the list of items I needed and add an additional argument to the experiment processor with that data, but I knew that we would want more and different signals in the near future. So I took another couple of hours, refactored all of the code to properly model a user, changed and enhanced all the tests, and finally, finished the job... Except it didn't work. I had forgotten to update the clone method in our DTO to copy the new data I had added. My colleague stepped in for that last bit of belly shaving, as I had now blown my estimate by a good day and my other work was piling up. After going through the irritation of changing 4 projects for 1 line of code, he understood my earlier cursing, and we had our feature finally ready.

A Hack Story


On Wednesday, I get a report that there's a snag in a release that is supposed to go out this week. We're taking some work that used to be done manually by our email marketing staff and automating it. The developer on the project has hit a bug. Apparently, our email service provider does not support UTF-8 encoding, and of course, being a fashion company, we have designer names with the occasional accent that we need to throw around. He's been struggling with this, trying to figure out why escaping isn't working, and finally discovering that only Latin-1 will suffice. Before he sits down to write this re-encoding of everything, we ask our email marketers how they did this in the past. "Oh, we just change them to plain text characters". How many characters are we talking about? "Just an é and a ë". Would it be easier to just convert them to plain text than to Latin-1? Yes. And so we did.

Yaks vs Hacks


I don't like hacks, but they can be expedient. Instead of spending time going back and forth with our ESP we will get this manual email automated and out the door, and worry about our text encoding problems later. Shaving this yak buys us nothing at this point. 
On the other hand, I thought my new feature would be a quick hack that could give us a bunch of interesting data, and it turned into a minor yak (or perhaps just a shaggy water buffalo) that took a bit too long but was not quite a full-blown project. I couldn't hack the dependency problems, and hacking the code restructuring would have resulted in threading an arbitrary parameter through a bunch of unrelated code just to avoid thinking about the right solution. The upside of the yak shaving was more data, a better code base for making such changes in the future, and a recognition of a serious dependency structure problem that we need to address. So today I celebrate Yak Shaving Day. I hope it only comes once a year!

War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Budgeting for Error

What's your uptime SLA over the last month, six months, year? Do you know off the top of your head? Is your kneejerk response, "as close to 100% as possible"? Consider this: by not knowing your true current SLA, you not only turn a blind eye to a critical success metrics for your systems, you also remove the ability to budget within the margins of that metric.

There are about 8765 hours in a year. How many of those hours do you believe your code absolutely needs to be up to keep your business successful? 99% of the time buys you almost 88 hours of downtime over the course of a year. 99.9% of the time still buys you almost 9 hours. Even 99.99% gives you about 52 minutes a year that your systems can be down. Think of what you can do with these minutes. Note that the much-vaunted "5 9s" reliability (99.999%) breaks down to 8 minutes over the course of a year, which is great if you're Google or the phone company, but probably not a smart goal for your average startup.

Let's say you know that your deployment process is rock-solid without outage and you will never need planned hardware downtime due to the way you've architected your systems. But you also know that you have some risky features that you want to push now, before you announce a critical partnership that should result in a big membership bump. If you're sure that the bump won't cause downtime, you might choose to push the features and risk some downtime in smoothing out rough edges on the code so that you have a really compelling site for those new members.

On the other hand, if you know that you're pretty solid under your current load but a 50% increase in usage has the potential for some degree of system failure, your error budget might not accommodate both the risky new features and the membership growth. And if your business pushes you to do both the risky new features and the growth risk? Make sure they know that your SLA may suffer as a consequence. When you know your goal SLA, and you know something is likely to reduce or violate it, that's a strong signal that you should think carefully about the risks of the project. This can also be a useful negotiation tool when being pushed to implement a feature you don't think is ready for prime time. When they say we need to release this new feature today, which means at least two hours of downtime that pushes you out of SLA, it becomes their job to get authorization from the CTO instead of your job to convince them why it is a bad idea.

I will admit that I do not currently have an uptime SLA for my services. Up until recently, it never occurred to me that there would be any value to trying to pin down a number and measure to it. As a result, while liveness and stability is always a consideration, I haven't taken the time to think through the rest of the year when it comes to hardware upgrades, new features, or deployment risk as measured by likely downtime impact. I'm missing out on a key success metric for my infrastructure.

Once I've managed to nail down a course-grained uptime SLA for my systems, the next phase of this work is to nail down a more fine-grained response time SLA. Of 100 requests, what is the 95th percentile response time from my infrastructure services? This is much trickier than a simple uptime SLA due to the interaction of multiple systems each with their own SLAs. For now though, I need to focus on the big picture.

Intuition, Effort, and Debugging Distributed Systems

I recently watched this great talk by Coda Hale, "The Programming Ape". It's heavily influenced by Thinking Fast and Slow, a book about cognitive processes and biases. One of the major points of the book, and the talk, is that we have two types of thinking: intuitive thinking, which is fast, easy, creative, and sloppy, and attention-based thinking, which is harder, but more accurate.

One of the great points that Coda makes in his talk is that most of the ways we do things in software development are very attention-heavy. At the most basic level, writing correct code requires a level of sustained attention that none of us possesses 100% of the time, which is why testing (particularly automated testing) is such an essential part of quality software development. Attention doesn't stop when you get the code into production, you still have the problem of monitoring, which often comes in the form of inscrutable charts or messages that take a lot of thought to parse. Automation helps here, but as anyone that has ever silenced a Nagios alert like a too-early alarm clock knows, the current state of automation has limits when it hits up against our attention.

By far the most attention-straining thing I do on a regular basis is debugging distributed systems. Debugging anything is a very attention-heavy process; even if you have good intuition about where the problem may lie, you still have to read the code, possibly step through it in a debugger or read through a log output and try to find the error. Debugging errors in the interaction between distributed systems is several times more difficult. A debugger is often of no help, at least not initially, because you have to get a series of events to happen in a particular way to trigger the bug. Identifying that series of events in most cases requires staring hard at a series of log files and/or system state dumps, and trying to piece together the ordering based on timestamps that may slightly differ between systems. I consider myself to be a very good debugger and it still took me a solid 4 hours of deep concentration, searching through and replaying transaction logs before I was able to crack through this particular bug. I would never hold the ZooKeeper code base up as a paragon of debugability, but what can we do to make this easier?

When you're writing a distributed system, think hard about what you log. This may be impossible to always get right, but so often the only way you have to find that bug is log files from around the time it happened. If you're going to reconstruct a series of events, you need evidence that those events happened, and you need to know when they happened. Should you rely on the clocks of the machines to line up enough to put the time series together, and should you fail the system if the clocks are too far apart? Since it's a distributed system, is there a way for all of the members of the system to agree on a clock that you could use for logging? As for the events themselves, it is important to be able to easily identify them, their particular behavior, and the state they are associated with (the session that made this request, for example).

One of the problems with ZooKeeper logs, for example, is that they don't do a great job of highlighting important events and state changes. Look at this, does it make your eyes glaze over immediately?


Events are hidden towards the ends of lines, in the middle of output (type:setData, type:create). Important identifiers are held in long hex strings like 0x773516a5076a0000, and it's hard to remember which server/connection they are associated with. To debug problems I have to rely on pen/paper or notepad records of what session id goes with what machine and what the actual series of events was on each of the quorum peers. Very little is scannable and it makes debugging errors a very tedious and attention-heavy process.

Ideally, we want to partially automate debugging. To do this, the logs have to be written in a form that an automated system could parse and reason about. Perhaps we should log everything as JSON. There's a tradeoff though, now a human debugger probably needs another tool to parse the log files at all. This might not be a bad thing. Insisting on basic text for logging leaves out the huge potential win of formatting that can draw the eye to important information in ways other than just text.

Are there tools out there now to aid in distributed debugging? A quick google search shows several scholarly papers and not much else, but I would guess that given the ever-increasing growth of distributed systems we'll see some real products in this area soon. In the meantime, we're stuck with our eyes and our attention, so we might as well think ahead about how we can work with our intuitive systems instead of against them.

Developer Joy, Part 2

I got some good comments on my last post, including a few submissions. One idea that I intentionally left out was that of working on a project you're passionate about. In fact, I think the point of developer joy is that developer joy has to transcend love for a project. It is the stuff that keeps us going despite working on code with a purpose which we find boring. It is the stuff that makes writing dull-but-important software worthwhile. It's the stuff that, in short, every good tech lead should keep an eye on.

Sadly, most developers are not working on problems that they find intrinsically fascinating. I told myself in coming into computing as a career that one reason I was doing it was that it had application in so many areas. And getting to apply computing to the fashion industry is a problem I find rather fascinating (and never in a million years expected to be solving). But I spent many years in industries that bored me to death to get to where I am today, and I still enjoyed myself immensely doing it so long as I had enough elements of developer joy.

As an illustration of this joy, one of the more enjoyable projects I ever worked on was the distribution of the test runner for a code base I worked on. The code base had thousands of tests, most of which were slow integration tests, and we expected this test suite to run and pass both on every checkin, and before developers checked in code. We'd gotten to a point where tests were taking 4 hours and most of a developer's local CPU cycles. We weren't willing to just get rid of tests, we weren't able to refactor the external dependencies completely to make the individual tests themselves faster (ah, sybase stored procedures!), so we set about to figuring out how to run this build process across a farm of machines. I like distributed computing, but this problem was more about dealing with the distribution system we used, debugging strange transient errors when we pulled formerly-sequential tests apart, and setting up a bunch of infrastructure to get things working.

Why did this project end up being so fulfilling? Well, for one, the team working on the project was incredibly smart. The main developer taught me a ton about the underlying distributed system and in turn we had some great team sessions of brainstorming. The whole project was around testing, and of course we wrote tests themselves for the project components so it was a well-developed system. We had plenty of time to actually work on the problem. And the solution we produced was pretty awesome. In short, it was a fun problem to solve not because I love build systems and tools (a common mis-conception about me, actually), but because I love solving hard problems, solving them well, and solving them with smart people.

Some developers are always going to need projects that they are passionate about. Those developers tend to form startups, go into academia, or find a spot in a big company with the resources to splash on their particular niche. But most of us work on problems that we solve to pay the bills. That doesn't mean there isn't joy in a day's work. When that work involves learning new things, perfecting your own expertise, being able to solve problems and see those solutions delivered, it gives us joy. Tech leads, CTOs, and founders sometimes forget that the people working for them may not have the same love for the project that they do. Keeping up the elements of developer joy helps bridge that inspiration gap.