development pitfalls

The Trials and Temptations of the New Leader: "Cool Factor"

Being a new leader at a startup is hard for many reasons. You think you're good at a lot of things, only to discover that you're not. You were fooled by success gained inside of a company with hidden structures that helped you succeed, the invisible backpack of big(ger) company privilege.

One very common area of weakness is recruiting. When I started to lead engineering at Rent the Runway I fancied myself pretty good at recruiting. After all I had done it well in prior gigs, I was friendly and engaging in interviews, so I'd be fine! Of course I quickly realized that startup recruiting is enormously different than at a company with a whole recruiting department sourcing candidates, making sure the process goes relatively smoothly, and of course, paying them fat industry++ salaries. Outside of this structure you experience a world where you reach out to so many people and get nothing but silence, blank stares, or polite dismissals. Your CEO tells you that you've gotta sell, and asks for your sales pitch. And often, faced with a string of failures and pressure to grow, you land on the need to do something "cool" with technology to up your "cool factor."

You know what I mean. Chase the buzzwords: microservices, Go, big data, event-driven, reactive, functional, etc etc etc. The only way engineers will want to come work for me on my relatively straight-forward application development is if I give them the carrot of Cool New Technology!

I have learned a few things over the years, and one of those things is that usually engineers that are only interested in Cool New Technology are not going to stick with your Boring Business Problem for long enough to be worthwhile, or worse, they'll stick with it long enough to leave a trail of one-off solutions that no one else on the team understands before they walk away and leave you holding the bag.

If you're tempted to reach for Cool Tech, then I'm going to guess that you're not at a company where the primary challenge is purely technical or scaling. Instead, the interesting problem that your company is solving is almost certainly a combination of a) figuring out how your business, possibly the first of its kind, is going to survive, and b) growing, changing, and evolving to create a functioning organization. Once your company is successful, many of the problems that seemed trivial become surprisingly challenging to solve at scale, but in the early days oversolving with cool tech only leads to distraction from tackling the real challenges.

So resist the urge to adopt any technology for the cool factor of recruiting. Instead, look for people who want to own big parts of the system, who are interested in the business, who are really passionate about the customers you're serving, who are looking for leadership opportunities. Don't undersell the opportunity you have just because it isn't Cool New Technology. Be honest with yourself about the real problems that make you excited about the business that you're in, and that's where you'll find your best sales pitch.

Branching Is Easy. So? Git-flow Is Not Agile.

I've had roughly the same conversation four times now. It starts with the question of our deployment/development strategy, and some way in which it could be tweaked. Inevitably, someone will bring up the well-known git branching model blog post. They ask, why not use this git-flow workflow? It's very well laid out, and relatively easy to understand. Git makes branching easy, after all. The original blog post in fact contends that because branching and merging is extremely cheap and simple, it should be embraced.
As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.
But here's the thing: There are reasons beyond tool support that would lead one to want to encourage or discourage branching and merging, and mere tool support is not reason enough to embrace a branch-driven workflow.

Let's take a moment to remember the history of git. It was developed by Linus Torvalds for use on the Linux project. He wanted something that was very fast to apply patches, and supported the kind of distributed workflow that you really need if you are supporting a huge distributed team. And he made something very, very, very fast, great for branching and distributed work, and difficult to corrupt.

As a result git has many virtues that align perfectly with the needs of a large distributed team. Such a team has potentially long cycles between an idea being discussed, being developed, being reviewed, and being adopted. Easy and fast branching means that I can go off and work on my feature for a few weeks, pulling from master all the while, without having a huge headache when it comes to finally merge that branch back into the core code base. In my work in ZooKeeper, I often wish I bothered to keep a git-svn sync going because reviewing patches is tedious and slow in svn. Git was made to solve my version control problems as an open source software provider.

But at my day job, things are different. I use git because a) git is FAST and b) Github. Fast makes so much of a difference that I'm willing to use a tool with a tortured command line syntax and some inherent complexity. Github just makes my life easier, I like the interface, and even through production outages I still enjoy using it. But branching is another story. My team is not a distributed team. We all sit in the same office, working on shared repositories. If you need a code review you can tap the shoulder of the person next to you and get one in 5 minutes. We release frequently; I'm trying to move us into a continuous delivery model that may eventually become continuous deployment if we can get the automation in place. And it is for all of these reasons that I do not want to encourage branching or have it as a major part of my workflow.

Feature branching can cause a lot of problems. A developer working on a branch is working alone. They might be frequently pulling in from master, but if everyone is working on their own feature branch, merge conflicts can still hit hard. Maybe they have set things up so that an automated build will still run through every push they make to that branch, but it's just as likely that tests are only being run locally and the minute this goes into master you'll see random failures due to the various gremlins of all software development. Worst of all, it's easy for them to work in the dark, shielded from the eyes of other developers. The burden of doing the right thing is entirely on the developer and good developers are lazy (or busy, or both). It's too easy to let things go for too long without code review, without integration, and without detecting small problems. From a workflow perspective, I want something that makes small problems come to light very early and obviously to the whole team, enabling inherent communication. Branching doesn't fit this bill.

Feature branching also encourages thinking about code and features as all or none. That makes sense when you are delivering a packaged, versioned product that others will have to download and install (say, Linux, or ZooKeeper, or maybe your iOS app). But if you are deploying code to a website, there is no need to think of the code in this binary way. It's reasonable to release code behind feature flags that is not complete but flagged off, for purposes of keeping the integration of that new code in for testing in other environments. Learning how to write code in such a way as to be chunkable, flaggable, and almost always safe to go into production is a necessary skill set for frequent releases of any sort, and it's essential if you ever want to reach continuous deployment.

Release branching may still be a necessary part of your workflow, as it is in some of our systems, but even the release branching parts of the git-flow process seems a bit overly complex. I don't see the point in having a develop branch, nor do I see why you would care about keeping master pristine, since you can tag the points in the master timeline where you cut the release branch. (As an aside, the fact that the original post refers to "nightly builds" as the purpose of the develop branch should raise the eyebrows of anyone doing continuous integration.)  If you're not doing full continuous deployment you need to have some sort of branch that indicates where you cut the code for testing and release, and hotfixes may need to go into up to two places, that release branch and master, but git-flow doesn't solve the problem of pushing fixes to multiple places. So why not just have master and release branches? You can keep your release branches around for as long as you need them to get live fixes out, and even longer for historical records if you so desire.

Git is great for branching. So what? Just because a tool offers a feature, and does it well, does not mean that feature is actually important for your team. Building a whole workflow around a feature just because you can is rarely a good idea. Use the workflow that your team needs, don't cargo cult an important element of your development process.

The Siren Songs of Hack Day Projects

I love hack days. I think spending a day every now and then to just work on whatever inspires you is one of the best trends of the last five years. Whether you use it to try out a new idea, fix a problem that's been bothering you for months, or play with a new tool, it's a great use of engineering time.

However, we can't forget the seductive illusion of the hack. When you have one day to make something, and you decide to make something new, you make it however you can. You don't write tests (unless perhaps you are the most enlightened TDDer). You don't worry about edge cases. You don't plan for scale. You're playing! You're having fun! You're creating something beautiful and impressive that can bring new frontiers to your business!

What's the downside? A good hack day produces ideas and projects that you want to put into production. But when the time comes to actually take those projects and make them production-worthy it's often quite a letdown. At Rent the Runway we've seen this happen twice over our two hack days. On the first, the whole team worked on a project to rebuild "Rent the Runway" on a completely new platform. The story goes that they got 80% of the way through in one 24 hour binge. 80%! This incredible feat actually lead us to say that we would move the entire site off of our existing platform within the next 3-4 months. Heck, if we could get 80% of the site rewritten by the whole team working for a day, surely we could get the site rewritten completely in a few months.

I'm a bit embarrassed to admit that we all bought into the hype of the 80% completion for longer than we should've. In reality, the 80% completion was more like "80% of basic functionality completed", which is still very impressive but ignores the 50% of required functionality that goes well beyond basic. A 3-4 month rewrite timeline might be realistic if you could take the entire engineering team, build no new features, and eventually ship something similar but not quite the same. Of course, none of these things are ever true.

The most recent hack day had some of our engineers build a very cool project utilizing our reviews data. So cool, in fact, that our product team latched on to it and decided to make it into a full-fledged major launch. Success! And now... we're just finishing a couple of intense weeks of project planning for what is quickly turning into a man-year's worth of an engineering project. It will be cool when it launches, but I don't think any of us expected this cool thing that took a day to hack to turn into 4 engineers working for three full months just to get a polished v1.

The hack day siren song can also cause you to fall in love with your hack so much you don't see the cost to making it production-ready. I've known of engineers that went off on their own and neglected their other duties for projects that would never see the light of production due to their complexity and questionable business value. It's hard to complain when people are working on something they're passionate about, but it's important for a team to feel like they are all working toward the same goal, and that can be difficult when some members are spending most of their time tinkering with side projects.

I'm jaded and boring, so my most recent hack day project took some images that were stored in the database and put them on the CDN. It took me longer than anticipated, but I wrote tests and made it work in both production and staging and released it that day. (My expected 2 hour hack still took me closer to 6 hours end to end.) It also enabled other hack day projects, which was really the point. At the end of the day, I'm glad most of the engineers I work with use their hack days to shoot for the moon, even if the final results always take longer to achieve than we hope and maybe spill over past that one day. After all, what's the point in fast-loading images if no one ever views them?

War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out. 

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class. 

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Process Debt and Team Scalability

Most of us in tech spend time thinking about technical debt. Whether it's a short-term loan or a massive mortgage, we've all seen the benefits and costs of managing and planning a code base with technical debt, and we are generally familiar with ways to identify, measure and eliminate this debt.

What about process debt? Process debt is a lot like technical debt, in that it can be a hack or a lack. The lack of process as debt is pretty easy to see. For example, never bothering to create an automated continuous build for your project. That's a fine piece of process to avoid when you have only one or two developers working on a code base, but as your team grows, your lack of process will start to take its toll in broken code and wasted developer time.

There are also plenty of ugly process hacks. Take for example the problem of a team of varying skill levels, and a code base that is increasingly polluted by poorly formatted, totally unreadable code. When faced with this problem you may be tempted to institute a rule where every line of code must be reviewed by a teammate. Later you discover that the two people with the worst style are reviewing each other, and things still aren't improving enough. So you declare that everyone must have their code reviewed by one of the senior developers that you have appointed. Now you have better style, but at the cost of everyone's productivity, especially your senior developers.

Adding code reviews was the easiest path to take. You just hacked in some process instead of paying up front to think about what the best process would be. But process, once added, is hard to remove. This is especially true for process intended to alleviate risk. Maybe you realize after a few months that the folks with questionable style have learned the ropes, and you remove the rule. But what if you added code reviews not to fix a style problem, but to act as a risk mitigant? I've done this myself in the case of a code base with very little automated testing and a string of bad releases. I'm now paying for my technical debt (a lack of automated testing) by taking out a loan on process.

An easy way to identify process debt is to ask yourself the following question: Will this solution still work when I have twice as many developers? What about ten times as many? A hundred? Process debt, like technical debt, makes scaling very difficult. If every decision requires several meetings and a sign-off, if every git pull is a roll of the dice for your continued productivity, or if you have to hire three project managers for every five developers just to keep track of the task list, you're drowning yourself in process debt.

Developers are wary of process because we too often experience process debt of the hack sort. I believe that both a lack of process and an excess of process can cause problems, but if we begin to look at process debt with the same awareness and calculations that we apply to technical debt, we can refactor our playbook the way we refactor our code and end up with something functional, simple, and maybe even elegant.

Framework Developers, Application Developers

I was chatting over drinks with a buddy of mine (All Things All Things, aka Joe Stein) the other day, and we both agreed that we were annoyed with open source frameworks that seemed like they were built by people that never had written applications using said frameworks, and sometimes by people that seemed to have never developed applications at all. I've been both an application developer and a framework developer, and I can say without question the worst job I've ever done with a codebase was the case of working on a framework that I never used and didn't originate myself. Why does this happen? I'm a good developer, but I'm not immune to the common pitfalls of framework/library development.

Pitfall 1: Never running a feature in a real application
I think this is a very common problem of frameworks developed by people that aren't actively using them. You think of a cool feature, or maybe some user asks you for one, and you spec it out and implement it. You hopefully write some good unit and integration tests, and everything seems to work. But of course, you neglected to test things like what happens when the whole system is rebooted and the state of this feature changes. Especially with certain kinds of features you can build it half right and have it silently fail for a long time before anyone notices. Quotas in ZooKeeper are an excellent example of this: a monitoring feature that worked until the quota was written to snapshot, and didn't seem to be used by any of the maintainers of the project. (cf this not very descriptive jira)

Pitfall 2: Never having to test application code that uses this framework
I'm hitting this a bit in my usage of the Play framework. It's a framework that did have a lot of testing features built into it but... they neglected to implement Filterable in their Junit runner, so you can't run a single test out of a class in your IDE. I submitted a fix for this feature a few weeks ago that has been withering on the vine, despite the fact that this is an incredibly annoying thing to overlook and a trivial thing to fix. The framework also doesn't support changing the http port on the command line when running tests automatically. Why would you need to, unless you happen to have a code base with several active branches in development that are also being automatically tested as I do right now. The framework developers may never get bitten by this, but it's definitely an annoyance as an application developer using the framework.

Pitfall 3: Throwing in everything and the kitchen sink
I recently saw a retweet asking why the hell Guava would add an Event Bus feature. Does that really belong in a collections framework? When your whole life is the framework you're developing, sometimes no feature seems too small or too unrelated. Unfortunately, putting in too much for the sake of completeness can make your code harder for application developers to fully grasp. If I have several different subtle variations of a method, with slightly different argument lists, I have to constantly check the javadoc and stop to think every time I try to use your library I'm likely to use it less, or just find one way to do it and always do it that way. I will, and have, rejected libraries in the past on the basis of being overly feature-laden. I don't always want or need complexity, and I'd frequently rather work around a small missing element than spend my life searching for exactly the method I want to call.

Pitfall 4: Making your library difficult to read and debug through
When you coat everything in layers upon layers of indirection, reflection, deeply nested interface hierarchies, and painful call graphs, it's hard for your users to figure out what the hell is actually going to happen, and painful to debug through the code when something goes wrong. I can't possibly be the only developer that learns libraries half by reading the documentation, and half by just calling the method that seems right and reading through the code when it doesn't work. This is largely why I absolutely despise Fluent-style development. When it is done perfectly and just works (as in perhaps the case of something like Mockito), it's verbose but acceptable. When it's in a place where there are lots of links in the chain where something could go wrong, it is an absolute nightmare to read and debug. I'm keeping the call stack of my own application in my head, please make your library as easy as possible for me to add to that mental complexity.

The best way to get over most of these pitfalls is to have at least one person on your framework team that actually uses the framework you're developing for something else. Barring that, listen to your users carefully. When they are confused by how to figure out what to call, frustrated by the difficult of debugging, or complaining about the difficulty of testing your framework, these aren't problems to treat lightly. Remember, your framework succeeds or fails based not on it's own internal merits, but on how many people actually use it to develop other code. Application developers are a framework developer's best friend.