The Best Decision I Made in 2014

Many people think that the role of leadership is decision-making. The desire to be the one who makes the calls drives some to climb the ladder so that they can become "The Decider."

I'm sorry to disappoint those who want this to be true, but in my experience the role of leadership is in fact to make as few decisions as possible, and to make the decisions that you are forced to make utterly mundane. Here are some of the mundane decisions I've made this year:
The first pass of a seating chart
Perfunctory approvals around uncontroversial hiring
Rubber stamping of well-thought-out architectural decisions
Signoff on budgets for vendor products we clearly need

I set up a lot of policies over the past few years. Some of them I've blogged about, for example, promotion committees. But I've also created policies around how new languages and frameworks are introduced, on-call rotations, even how we buy office equipment (oh the glamorous job of a CTO!). The goal of pretty much all of these policies is to make future decisions easier, and to empower various people on my team to make decisions without me.

So, I don't believe that good leadership is heavy on decision-making. That being said, you can't be a leader without occasionally setting a direction, which leads me to The Best Decision I Made in 2014. It started with a twitter conversation about continuous delivery, on January 3. I have been thinking about continuous delivery forever, and trying to move Rent the Runway in that direction for as long as I've been running engineering. At the end of 2013, we created a task force to make our deployments, then-weekly and taking up to 6 hours to do, faster and less painful. The team was well on their way to success by January 3. And so, inspired by my conversation, I sent this email:

Starting in Feb

Camille Fournier Sat, Jan 4, 2014 at 9:40 AM
To: tuskforce
I want the one ring to release every day. Even if there's no user visible changes. Even if there's traffic. Even if it means we break things
That's it. I held my breath to see the responses. And as they rolled in, one by one, all of the engineers agreed. They were excited, even! So I started to tell others. Our Head of Product. My CEO. I told them "this might cause some pain, the first couple of weeks, as we figure out how to do this safely, but it's important." 

And so, come February, we began releasing every work day. And it was glorious. 

What made this decision so successful? What did I learn from this?

A great decision is often not a revolutionary move. We were doing the technical work to enable this already. The team wanted to be deploying more frequently. All I did was provide the push, to raise the bar just a little bit higher and express my confidence that we would easily clear it. 

The thing I learned from making this call wasn't anything about the decision itself. It was about the process that got me there. You see, this came about right after new year's, when things were quiet at work and I had some time to sit alone and think (or, in this case, tweet). All the sudden, I had ideas again! And it hit me: I need regular time alone, away from meetings and people, in a quiet room with a whiteboard. 

So I started blocking my calendar every Wednesday afternoon, and things started to change for me. In those Wednesday afternoons, I thought through the next evolution of our architecture. I thought through our engineering ladder, and our promotions process. I thought about problems I was having with people and how I could make those relationships better. I did some of the foundational work to create the 7 completely new talks that I wrote and delivered in 2014. I made time for the important but not urgent. In short, I grew from a head of engineering focused on the day-to-day into a CTO who thought a lot about the future.

The best decision I made in 2014 wasn't, actually, to tell my team to release every day. That's just the story I've been telling myself all year. The best decision, really, was to make time to think.

"Meritocracy" and the Tyranny of Structurelessness

Engineers like to believe in the idea of meritocracy. We want to believe that the best idea wins, that the person who produces the most value is rewarded, that we judge only on what a person brings to the table and nothing more.

Nothing could be further from the truth, of course, because we are ultimately human, and humans are biased in many ways both subtle and not. This post is not going to attempt to educate you on human bias. If you're unfamiliar with this concept, I'd welcome you to watch this talk put together by Google on the realities of bias in the workplace and efforts you can take to combat this.

Now, all that being said, I love the idea of meritocracy. After all, I am a CTO, surely I am here mostly due to my merit! OK, even ignoring my own position, I would really like to create an organization that does behave in a meritocratic fashion. I don't want to say that someone has to have X years of experience to do something, or Y arbitrary title. I want to reward people who show up, take on big tasks, and produce great results.

The most common way that people at startups attempt to create meritocratic environments, to avoid this title-driven fake hierarchy, is to eschew titles entirely. Eschew titles, have "flat" organizations. Removing the trappings of hierarchy will mean that we are all equals, and will create a place where the best idea wins, right?

Removing titles and pretending that the hierarchy doesn't exist does exactly the opposite of creating a meritocracy. It most often creates a self-reinforcing system where shadow hierarchies rule the day and those outside the in-group have even less opportunity to see their ideas come to life. It frustrates newcomers, and alienates diverse opinions. It is, in short, the enemy of meritocracy.

What is a poor meritocratic-seeking engineering leader to do?

The only answer I have found is echoed in the video I linked above. Far from eschewing titles and pretending no hierarchy exist, you must acknowledge this reality. And, furthermore, you need to be really, really explicit about what it means to actually be working at the level indicated by these titles. You need to first, make it really clear to yourself what is required at every level, and then make it really clear to your team what it means to be at every level. 

This is not easy to do. My greatest fear in implementing this has been the fear that people will come to me and try to "lawyerball" me into promoting them to a level that I don't feel they are working at. Or that people will become obsessed with their title and constantly be trying to get promoted and treating each other differently due to titles.

To point 1, though, if a person truly is meeting everything I have laid out as being necessary for working at a level, why would I not want to promote them to that level? It either means that I haven't really articulated the level clearly enough to really encompass the responsibilities, or in fact, they really deserve to be promoted and I am letting my bias get in the way of evaluating merit.

To point 2, if I lay out levels that I believe are genuinely increasing in impact and responsibility and have high bars to clear, why would I be upset if people strive and work hard to grow into them? "Meritocracy" doesn't mean "Only reward people who are naturally gifted at what I value." That's the thing I'm trying to stop doing!

On treating each other differently due to titles, well, that's a two part problem. The first part is this: creating a culture where ideas are welcome from anywhere requires cultivation with or without titles. The second is that generally people get promoted because they have shown bigger impact and influence on the organization, and so it's not that surprising that they will have bigger voices. I'm not sure that is a terrible thing, if those people are living up to the high standards that come with that influence.

Finally, of course, to get a group to embrace this is tough. So why not take the decision-making power to promote people out of the hands of managers? That is what we have done in my team. We now use promotion committees, composed of engineers at a level or two above the person trying to get promoted. Now the whole team is bought into the idea that promoting someone is not a gift bestowed by management but an acknowledgement by a group of peers that one should join their ranks. 

This is not going to be perfect, and it is a lot of work for me to implement. But in my experience taking the time to establish clarity in anything is a worthwhile exercise, and creates a better and ultimately more efficient organization. 

When Defining Reality, Don't Forget To Deliver Hope

I had a great 1-1 with one of my tech leads today, who came by my office hours and asked me for advice on becoming a better manager. I gave my usual rambling reply to broad inquiries; we talked about making personal connections, reading a lot of blogs and books, experimenting with different ways of asking questions to get people to reveal their true interests to you, so that you can better help to nurture and serve those interests.

But then, at the end, I had a thought. I asked him, how often do you talk to your team about what the future is about? How well do you know what the future of your team should be like? Not the product roadmap, which is in the capable hands of an amazing product lead, but the technical future. How to think about building systems that are future-thinking, becoming a better team, writing better code.

"Not that often" he admitted.

So, I persisted. How often do you spend some time away from your keyboard, away from the internet, away from meetings, and think about what you think the future of your team should be, the areas that you could focus on, the big opportunities for growth?

Again, the answer was "Not that often."

This is not at all surprising, of course. When you work in an industry where you focus on building out technical skills and getting more things done for the first many years of your career, making that shift into management (really a career change) can lead to the temptation to focus on solving today's problems now. Solving today's problems well is probably how you ended up rising into a tech lead role, and we all know that there is never a shortage of problems.

But when you focus on nothing but today's problems, even if you are a great manager who does everything right, you are unlikely to motivate your team to greatness, or inspire the level of loyalty and passion that makes a team gel and prosper. You are missing the one thing that you cannot overcome with great management skills. You're missing leadership.

You can be the greatest manager in the world, but without leadership and vision, your team will not be truly sticky to you.

Fortunately, leadership is not a skill you have to be born with. It just requires that you identify the future and articulate it. "Define reality, give hope." Too often first-time tech managers focus on reality. We're comfortable with reality, reality is our bread and butter. But the future? Painting a vision for the future, even the future 6 months out, is a risk. You may not be able to deliver on that future vision. It may not be the right thing to do when the time comes. Reacting to today is so easy, and trying to predict the future seems really hard

But there's a simple (note: not easy) secret to breaking that habit and creating a vision: practice. Get away from your keyboard. Force yourself to sit in an empty room with a whiteboard or a pen and paper and write some ideas down. Grab a colleague or two to brainstorm with if you need to, but do some of the work by yourself. Then start painting that future to your team. This is the most important thing you can do to become a truly well-rounded manager, and if you aren't doing it, block your calendar tomorrow and start.

On Charm, Skills and Management

I had a great twitter conversation today with Julia Grace on the topic of hiring managers, which you can see in pieces here. The outcome was a hard look at the past couple of years of my own management experience.

TL;DR: Being a manager is nothing at all like being a tech lead, and hiring managers on the basis of their strength as individual contributors is not the guaranteed way to great technical leadership.

I got into engineering leadership because, first off, I wanted to have more impact and responsibility, but secondly, because I was the most senior engineer on the team, and a good communicator to boot. This actually worked out fine for quite a while. I could inspire people to work for me largely because they believed that I had things to teach them, and when I had the time to sit individually with engineers I did teach them many things. When I was hands-on, I could easily identify problems with our process, bottlenecks in our systems, and features that we should push back on, and I did all of that and more. In short, I was a really great tech lead.

But tech lead stopped being my job long before I learned how to truly manage well (nb: I may not yet know how to manage well). When you're not actually able to spend lots of 1-1 time with engineers, and you aren't so deep in the code that it's easy for you to see how the process is frustrating, suddenly those bits of management that you could get away with being bad at become more important. The bits like "making time for 1-1s", and "identifying systemic issues via second-hand reports" and "talking about the status of projects that you haven't actually written any of the code for yourself."

If you wonder why you see so few great managers in startup land, I think the answer is obvious: most of us got to our position by being great tech leads, and we haven't all figured out how to make the leap from tech lead to manager. For me, it's taken a ton of coaching from both a general executive coach and a CTO coach provided by my company. On top of that, I am fortunate enough to have gotten some training while I was still at my finance job, and to have seen effective managers do their thing, so I have a general idea of what good management looks like. And even with all that, I've made pretty much every mistake in the book.

Truthfully, if your company doesn't provide you with a ton of structure and guidance or experienced managers to learn from, you've got a long road ahead of you if you want to become a great manager. Because it is not just about charm and engineering skills, and that is probably how you got here. When you want to hire and retain great engineers, they need someone they can learn from, and what do you do when you don't have the time to be that person? How do you hire the people that can teach and inspire in your place? How do you grow the engineers you have now into the great tech lead you once were? You thought you were good at recruiting when you had someone else sourcing and closing all the candidates, but now that person is you. Oh, and have you ever justified hiring plans? For a team whose code you've never actually written?

Beyond recruiting: You can identify process bottlenecks, but can you identify them when you are not personally impacted by them? You've heard that clear goal setting is the key to strong leadership, but do you know how to do it? No, really, are you prepared to spend your Sunday evening writing quarterly goals, which is what I'm supposed to be doing right now? Have you ever measured engineering efficiency? Do you know what that is? Made a strategic multi-year roadmap? While worrying about preparing a deck for a team meeting where you'll be explaining why your team should care about these goals at all, and writing mid-year reviews to boot?

I've hired and promoted various engineering managers since becoming the head of a growing team. Some have been experienced managers, some have been great tech leads looking to make the jump to becoming managers. They've all had ups and downs, but make no mistake, I've had to spend lots of time creating structures for all of them to be successful. The great tech leads may have an easier time winning over the engineers, but they still need to be taught the basics of structured management, and they won't all be successful or happy in that role. There's no silver bullet to creating a great leadership team except for putting in a structure that makes their responsibilities clear, for their sake and yours.

So, if you're a great tech lead who's moving into management, congrats! Just be aware, the only useful thing that gives you to take into future leadership is a general sense of best practices and hopefully the ability to communicate to other engineers and non-engineers. Engineering management is not just being the tech lead of bigger and bigger teams, and the faster you realize that, the better off your team will be.

Accountability and authoritarians

A not-so-secret aspect of my personality is that I have no problem with the policing aspects of leadership. I have a law and order side to me, if I believe in a rule/practice, and I see people breaking it, I have no problem being the one to call them to task.

This habit actually works out ok in a room full of peers; it doesn't exactly endear me to people but it means that, eg, the build doesn't stay broken for long. Unfortunately as I have gone up in the management chain, it has become a problem. Why does having a peer that holds you day-to-day accountable work, but having a boss that does so fail?

First, it seems that having me hold individuals accountable ends up causing some individuals to stop holding each other accountable. It is viewed as the thing I do, the boss' responsibility. If it's important, the boss will care, and she will come down and call us out. This is obviously far from ideal. A high performing team will hold each other accountable; after all, you want feedback to come as quickly and naturally as possible, and that can only happen when individual team members call out each other.

Second, I have come to discover that it is rather demoralizing for team members to be called out be me. I know, I know, why is this a surprise? It is still hard to get used to the idea that I am not just one of the team, and that everything I do is amplified and taken in ways I don't always anticipate. In fact, I think that it is important that most negative/corrective feedback from me go to my direct reports instead of farther down the chain. That doesn't mean that I can't offer suggestions on architectural improvements or process tweaks, but it is demoralizing for me to ask a developer why the build is broken. In fact, it is important that I'm seen as an inspirational figure to my team, someone they look up to and look forward to interacting with, and not vice-versa.

Finally, I'm actually pretty far away from the impact of the rules these days. What might have made sense when I was in a team writing code may not be ideal for the current team, the current environment. The only true best practice out there is to let your team take a concept or process and iterate on it until it takes on a form that works effectively for them. It simply doesn't make sense for me to make and enforce the rules myself; it is my job to provide the high-level goals (such as "creating a Cinderella experience for our customers and ourselves") and to push the team to find the right ways to implement those goals.

So, what is the takeaway here? I'm getting out of the business of being the rule maker and rule enforcer. Instead, I'm setting goals and very high-level guidelines, and giving the power to create policies and practices to the members of my team. I want to work myself out of a job, after all, and the best ideas don't come from me, so why should the policing?

Revisiting ideas: Promotion from Within

Ages ago a friend of mine wrote a rather seminal post on "promotion from within". It is interesting to go back and read this post through the lens of hiring and leading a team for a few years. It is a great post, but I think that one major point often gets overlooked, and this point causes many companies who follow its advice to fail. That point is this:
2) Using either the few experienced managers you've been able to internally promote or failing that, outside executive coaches, intensely mentor your more inexperienced managers to develop their skills. Typically, because many of your management candidates were less than fully-qualified, they will demonstrate potential but still be unsure in their new roles. Until they are comfortable and practiced in their roles, both they, their peers, and their teams will exist in a state of some distress. 

This is problem I have seen both at my own company and also observed at others': We promote from within, but provide no mentoring or guidance to those so promoted. Great managers are truly not born. They are made, and usually made through both being sat down and patiently taught the ways to effectively lead projects and people, but also through observing both successes and failures.  They are also made by being called to task on their own personal failures, something that many startups are unwilling or unable to do. The cult of personality around founders and early employees can work against the one thing necessary to make "promote from within" successful: some form of external help.

Most of us in the startup world are working amongst people that have very little experience managing. And we've taken these ideas that Yishan so eloquently voiced, that culture is paramount, and elevated them to high status, while forgetting that there is a lot to learn to be a successful manager. I know that I came into this job two years ago thinking that given my natural willingness to be in charge and my strong technical skills I would be a great manager. Haha! Truthfully I'm only now getting to the point where I have an inkling of all the things I don't know, and a large part of that is thanks to a ton of coaching. I would not be able to lead my team successfully without coaching, and even with my own coach, I need a coach to help the managers that report to me.

So, promote from within. But don't cheap out on the process by forgetting that these new managers and leaders need help, need training, need to be held responsible for both the good and the bad that they will inevitably produce in their first months and years as managers. Otherwise you might as well hire experienced external managers, because my hunch is that the payoff is actually equivalent, risking culture vs risking unguided learning.

Please stop threatening me with Moore's Law

For as long as I can remember, Moore's Law has been one of the great tech bogeymen. It's going to end, we fret, and the way we write code is going to have to change dramatically! The sky will fall, and we need to prepare ourselves for this end times!
When you hear the same message for well over ten years, its efficacy starts to fade a bit. You start to wonder, when exactly is Moore's Law going to end any more than it already has? When is this going to happen in a way that actually affects me more than it did five years ago? What even IS right around the corner, if the foretold end of performance hasn't much affected me in ten plus years?
The truth of the matter is this: if you care about Moore's Law, you're probably already writing code that combats it, because you care about performance. The trick of Moore's Law, as we all have been beaten over the head about, is parallelism. But why would I wait for the end of Moore's Law to bite me? The times I have cared about performance, I haven't waited. I've parallelized the crap out of code to move performance sensitive code bases from few fast cores to many slow cores. And as soon as I could, I ripped most of it out in favor of a distributed system that was faster and more scalable. And then that was ripped out in favor of smart streaming from SSD. The circle of tech life takes advantage of the latest hotness as needed to get the job done, and I have no reason to believe that Moore's Law is anything more than a factor in that equation.
These days most of us are concerned about a much more complex interaction of performance issues than simple processor speed. We're network sensitive, IO bound, dealing with crazy amounts of data, or simply trying to deal with a ton of simple things at once. We already have systems built to make IO and network calls asynchronous. We're already processing completely separate work independently. Because we can, because even without the terrible end of Moore's Law we care about performance. There's no need to call up the bogeyman to make your case. He's sitting in the next cube, making sure we parallelized all our outgoing requests, and he'd rather you stopped getting so hysterical on his behalf.

Getting From Here to There

As part of my work to make myself a better leader, I'm reading the book "What Got You Here Won't Get You There." The phrase itself resonates with me strongly now as I'm in reviews season and writing reviews for a number of senior engineers and engineering managers. One of the standard refrains I see in self-review "areas for improvement" is the desire to improve something very technology specific. A manager might say "I want to jump in more to the code so I can help take problems off my team." A senior engineer looking to grow to the staff engineer level might say something like "I want to get better at understanding distributed systems." Heck, even I am tempted to put "I want to spend more time in the early phases of architecture design so that I can help improve our overall technology stack." We're all falling victim to the problem of "what got us here won't get us there."

Engineers should spend their first several years on the job getting better at technology. I consider that a given and don't love to see that goal in reviews even for junior engineers, unless the technology named is very specific. Of course you're getting better at technology, programming, engineering, that's your job. You are a junior individual contributor, and your contribution is additive. You take tasks off of a team's list, get them done, create value by doing, and work on getting better at doing more and more. You got here because you put this coding time in.

After a certain point, it is more important to focus on what will make you a multiplier on the team. Very very few engineers write code of such volume and complexity that simply by writing code they enhance the entire organization. For most of us, even those in individual contributor roles, the value comes through our work across the team, teaching junior engineers, improving processes, working on the architecture and strategy so that we simply don't write as much code to begin with. There is a certain level of technical expertise that is necessary to get to this multiplier stage. As an individual contributor, a lot of that expertise is in knowing what you don't know. What do you need to research to make this project successful? You don't have to be a distributed systems expert but you should know when you're wading into CAP theorem territory.

It's harder for managers. Every time you switch jobs, you're interviewed with an eye towards the question, "can this person write code?" We don't trust managers that can't code, we worry that they're paper-pushers, out of touch. But when you hire an engineering manager, you often don't want them writing code. Managers that stubbornly hold onto the idea that they must write a lot of code are often either overworked, bottlenecks, or both. The further up you go in the management chain, the less time you will have to write code in your day job. The lesson here is not "managers should carve out lots of time to write code". It is, instead, don't get pushed into management until you've spent enough time coding that it is second nature. If you think that writing more code is the unlock for you to manage your team better, to grow as a better leader, you're going backwards. The way forward into true team leadership is not through writing more code and doing the team's scutwork programming. It is through taking a step back, observing what is working, what is not working, and helping your team fix it from a macro level. You're not going to code your team out of crunch mode by yourself, so spend your time preventing them from getting into crunch mode in the first place.

Coding is how you got here, engineering is what we do. But growing to levels of leadership requires more than just engineering. You've gotta go beyond what got you where you are, and get out of your comfort zone. Work on what makes you a multiplier, and you'll get there.

SOA and team structure

This week I sat on a panel to discuss my experiences moving to a service-oriented architecture (SOA) at Rent the Runway. Many of the topics were pretty standard: when to do such a thing, best practices, gotchas. Towards the end there was an interesting question. It was phrased roughly like this:
What do you do when you need to create a new feature, and it crosses all sorts of different services? How do you wrangle all the different teams so that you can easily create new features?
I think this is a great question, and illustrates one of the common misperceptions, and common failure modes, of going to a service-oriented architecture.

Let me be clear: SOA is not designed to separate your developers from each other. In my team, developers may work across many different services in order to accomplish their tasks, and we try to make it clear that the systems are "internal open source". You may not be the expert in that system, but when a feature is needed, it's expected that the team creating the feature will roll up their sleeves and get coding.

At big companies SOA is sometimes done in order to create areas of ownership and development. At my previous company, SOA was (among other things) a better way to expose data in an ad-hoc, but monitored, manner, without having to send messages or allow access to databases. The teams owning the services may be in a totally different area of the company from their clients. To access new data, you needed to coordinate with the owning team to expose new endpoints. It created overhead, but data ownership and quality of the services themselves was an important standard and losing velocity to maintain these standards was an acceptable tradeoff.

That is not the way a small startup should approach SOA, but if you don't anticipate this it may become an unintended outcome. When your SOA is architected as mine is, with a different language powering the services than that which powers the client experience (Java on the backend, Ruby on the frontend), you start to segregate your team into backend and frontend developers. We address this separation by working in business-focused teams that have both backend and frontend developers. Other SOA-based companies approach the challenge of separation by creating microservices that only do enough to support a single feature, so that a new feature doesn't always require touching existing functionality. Some companies do SOA with the same framework (say, Ruby on Rails) that powers their user-facing code, so there's no language or framework barrier to overcome when crossing service boundaries.

SOA is a powerful model for creating scalable software, but many developers are reluctant to adopt it because of this separation myth. There are many ways to approach SOA that work around this downside. Acknowledging the risk here and architecting your teams with an eye to avoiding it is important in successfully adopting this model in an agile environment.

2013: The Constant Introspection of Management

I've been a serious manager for a little over a year now, and it has been my biggest challenge of 2013. Previously I had managed small teams but looking back, while I thought that would prepare me to lead multiple teams and manage managers myself, in reality nothing could be further from the truth.

The hardest part for me of going from individual contributor/architect/tech lead to managing in anger has been the lack of certainty. I believe that, for experienced managers, management can have a level of rigor and certainty, but I'm not there yet. I am striving to be a compassionate manager and that requires developing a level of emotional intelligence that I have never before needed. And so as a result I would have to say that the past year has probably been one of the most emotionally draining of my life (and that is not even counting the baby I had in the middle of it).

I do not want to be a dispassionate leader who views people like pawns on a chessboard. But the emotional resilience that is required for management makes me understand how folks with that tendency may find it easier. A successful manager needs to care about her people without taking the things they do personally. Every person who quits feels like an indictment of all the ways you failed them. If only I had given better projects, fought harder for their salary, coached them better, done more to make them successful! If you wonder why it seems sometimes that management roles get taken up by sociopaths, just think that for every interaction you have with a difficult coworker, your manager has probably had to deal with ten of them. It's not a surprise that a certain bloody-mindedness develops, or, more likely, survives.

In addition to the general emotional angst of all those people, there's the general feeling of utter incompetence. As an engineer, I know how to design successful systems. I can look back on a career of successful projects, and I know many of the best practices for building systems and writing code. Right now, if I were to design a system that ultimately failed for a technical reason, I would be able to pinpoint where the mistakes were made. I am a beginner all over again when it comes to the big-league management game, and it's discouraging. I miss doing what I'm good at, building systems, and I'm afraid that I've given it up for something I may never do particularly well.

One of my friends who has faced the same struggle put it best. 
I like the autonomy/mastery/purpose model of drive. This feels like an issue with mastery. Not building means moving away from something where you have mastery to something new. There’s fear of losing mastery.[1]
As a new manager I believe you lose both autonomy and mastery for a time being, and arguably autonomy is lost forever. You are always only as good as your team, and while some decisions may ultimately rest on your shoulders, when you choose to take the "servant leadership" path you do sacrifice a great deal of autonomy. But I think for many engineers the loss of mastery hits hardest. When you've spent ten plus years getting really, really good at designing and developing systems, and you leave that to think about people all day? It's hard, and no, there isn't always time for side projects to fill the gap. In an industry that doesn't always respect the skills of management, this is a tough pill to swallow. After all, I can become the greatest manager in the world, but if I wanted to work in that role at Google they would still give me highly technical interviews.

So why do it? In the end, it has to be about a sense of purpose. I want to have a bigger impact than I will ever be able to have as an architect or developer. I know that leading teams and setting business direction is the way to ultimately scratch the itch I have for big impact, for really making a lasting difference. And I know that a great manager can have a positive impact on many, many people. So here's to growing some management mastery, and making 2014 a year of purpose and impact.

Replatforming? The Proof is in the Hackday

It's pretty common for teams, especially startups, to get to a point with their tech stack where the platform they've been working on just isn't cutting it anymore. Maybe it won't scale, maybe it won't scale to the increased development staff, maybe it was all written so poorly you want to burn it to the ground and start fresh. And so the team takes on the heroic effort we know and love: replatforming.

When you're replatforming because the current system can't handle the necessary load, it's pretty easy to see if your effort was successful. Load test, or simply let your current traffic hit it and watch it hum along where you once were serving 10% failures. But if the replatforming is done to help development scaling, how do you know the effort was a success?

I accidentally discovered one answer to this question today. You see, Rent the Runway has been replatforming our systems for almost the last two years. We've moved from a massive, and massively complex Drupal system to a set of Java services fronted by a Ruby thin client. Part of the reason for this was load, although we arguably could've made Drupal handle the load better by modernizing certain aspects of our usage. But a major reason for the replatforming was that we simply weren't able to develop code quickly and cleanly, and the more developers we added the worse this got. We wanted to burn the whole thing to the ground and start fresh.

We didn't do this, of course. We were running a successful business on that old hideous platform. So we started, piece by piece, to hollow out the old Drupal and make it call Java services. Then, with the launch of Our Runway, we began to create our new client layer in Ruby using Sinatra. Soon we moved major pages of our site to be served by Ruby with Java on the backend. Finally, in early July, we moved our entire checkout logic into Java, at which point Drupal is serving only a handful of pages and very little business logic. 

So yesterday we had a hackday, our first since the replatforming. We do periodic hackdays although we rarely push the entire team to participate, and yesterday's hackday was one of the rarer full-team hackdays. Twenty-some tech team members and four analytics engineers participated, with demos this morning. I was blown away with what was accomplished. One project by a team of four enabled people to create virtual events based on hashtags and rent items to those events, pulling in data from other social media sources and providing incentives to the attendees to rent by giving them credits or other goodies as more people rented with that hashtag. This touched everything from reservations to checkout. We had several projects done by solo developers that fixed major nasty outstanding issues with our customer service apps, and some very nice data visualizations for both our funnel and our warehouse. All told we had over ten projects that I would like to see in production, whether added to existing features, as new features, or simply in the data dashboard displayed in our office. 

Compare this to the last full-team hack day and the differences are striking. That hack day had very few fully functional projects. Many were simulations of what we could do if we could access certain data or actions. Most didn't work, and only a handful were truly something we could productionize. The team didn't work any less hard, nor was it any less smart than it is now. The major difference between that hack day and now is that we've replatformed our system. Now adding new pages to the site is simple and doesn't require full knowledge of all the legacy Drupal. Getting at data and actions is a matter of understanding our service APIs, or possibly writing a new endpoint. Even our analytics data quality is better. 

So if you're wondering whether your replatforming has really made a difference in your development velocity, try running a hackday and seeing what comes out. You may be surprised what you learn, and you'll almost certainly be impressed with what your team can accomplish when given creative freedom and a platform that doesn't resist every attempt at creativity.

ZooKeeper and the Distributed Operating System

From the draft folder, this sat moldering for a few months. Since I think the topic of "distributed coordination as a library/OS fundamental" has flared up a bit in recent conversations, I present this without further editing.

I'm preparing to give a talk at Ricon East about ZooKeeper, and have been thinking a lot of what to cover in this talk. The conference is focused on distributed systems at a somewhat advanced level, so I'm thinking about topics that expand beyond the basics of "what is ZooKeeper" and "how do you use it." After polling twitter and getting some great feedback I've decided to focus on the question that many architects face: When should I use ZooKeeper, and when is it overkill?

This topic is interesting to me in many ways. In my current job as VP of Architecture at Rent the Runway, we do not yet use ZooKeeper. There are things that we could use it for, but in our world most of the distributed computing we do is pure horizontally scalable web services. We're not yet building out complex networks of servers with different roles that need to be centrally configured, managed, or monitored beyond what you can easily do with simple load balancers and nagios. And many of the questions I answer on the ZooKeeper mailing list are those that start with "can ZK do this?" The answer that I prefer to give is almost always "yes, but keep these things in mind before you roll it out for this purpose." So that is what I want to dig into more in my talk.

I've been digging into a lot of the details of ZAB, Paxos, and distributed coordination in general as part of the talk prep, and hit on an interesting thought: What is the role of ZooKeeper in the world of distributed computing? You can see a very clear breakdown right now in major distributed systems out there. There are those that are full platforms for certain types of distributed computing: the Hadoop ecosystem, Storm, Solr, Kafka, that all use ZooKeeper as a service to provide key points of correctness and coordination that must have higher transactional guarantees than these systems want to build intrinsically into their own key logic. Then there are the systems, mostly distributed databases, that implement their own coordination logic: MongoDB, Riak, Cassandra, to name a few. This coordination logic often makes different compromises than a true independent Paxos/ZAB implementation would make; for an interesting conversation check out a Cassandra ticket on the topic.

In thinking about why you would want to use a standard service-type system vs implementing your own internal logic, it reminds me very much of the difference between modern SQL databases and the rest of the application world. The best RDBMSs are highly tuned beasts. They cut out the middleman as much as possible, taking over functionality from the OS and filesystem as it suits them to get absolutely the best performance for their workload. This makes sense. The competitive edge to the product they are selling is its performance under a very well-defined standard of operation (SQL with ACID guarantees), as well as ease of operation. And in the new world of distributed databases, owning exactly the logic for distributed coordination (and understanding where that logic falls apart in the specific use cases for that system) will very likely be a competitive edge for a distributed database looking to gain a larger customer base. After all, installing and administering one type of thing (the database itself) is by definition simpler than installing and administering 2 things (the database plus something like ZooKeeper). It makes sense to prefer to burn your own developer dollars to engineer around the edge cases, so as to make a simpler product for your customers.

But ignoring the highly tuned commercial case of distributed databases, I think that ZooKeeper, or a service like it, is a necessary core component of the "operating system" for distributed computing. It does not make sense for most systems to implement their own distributed coordination, any more than it makes sense to implement your own file system to run your RESTful web app. Remember, to do distributed coordination successfully requires more than just, say, a client library that perfectly implements Paxos. Even with such a library, you would need to design your application up-front to think about high availability. You need to deploy it from the beginning with enough servers to make a sane quorum. You need to think about how the rest of the functioning of your application (say, garbage collection, startup/shutdown conditions, misbehavior) will affect the functioning of your coordination layer. And for most of us, it doesn't make sense to do that up-front. Even the developers at Google didn't always think in such terms, the original Chubby paper from 2006 mentions most of these reasons as driving the decision to create a service rather than a client library.

Love it or hate it, ZooKeeper or a service like it is probably going to be a core component of most complex distributed system deployments for the foreseeable future. Which is all the more reason to get involved and help us make it better.

Branching Is Easy. So? Git-flow Is Not Agile.

I've had roughly the same conversation four times now. It starts with the question of our deployment/development strategy, and some way in which it could be tweaked. Inevitably, someone will bring up the well-known git branching model blog post. They ask, why not use this git-flow workflow? It's very well laid out, and relatively easy to understand. Git makes branching easy, after all. The original blog post in fact contends that because branching and merging is extremely cheap and simple, it should be embraced.
As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.
But here's the thing: There are reasons beyond tool support that would lead one to want to encourage or discourage branching and merging, and mere tool support is not reason enough to embrace a branch-driven workflow.

Let's take a moment to remember the history of git. It was developed by Linus Torvalds for use on the Linux project. He wanted something that was very fast to apply patches, and supported the kind of distributed workflow that you really need if you are supporting a huge distributed team. And he made something very, very, very fast, great for branching and distributed work, and difficult to corrupt.

As a result git has many virtues that align perfectly with the needs of a large distributed team. Such a team has potentially long cycles between an idea being discussed, being developed, being reviewed, and being adopted. Easy and fast branching means that I can go off and work on my feature for a few weeks, pulling from master all the while, without having a huge headache when it comes to finally merge that branch back into the core code base. In my work in ZooKeeper, I often wish I bothered to keep a git-svn sync going because reviewing patches is tedious and slow in svn. Git was made to solve my version control problems as an open source software provider.

But at my day job, things are different. I use git because a) git is FAST and b) Github. Fast makes so much of a difference that I'm willing to use a tool with a tortured command line syntax and some inherent complexity. Github just makes my life easier, I like the interface, and even through production outages I still enjoy using it. But branching is another story. My team is not a distributed team. We all sit in the same office, working on shared repositories. If you need a code review you can tap the shoulder of the person next to you and get one in 5 minutes. We release frequently; I'm trying to move us into a continuous delivery model that may eventually become continuous deployment if we can get the automation in place. And it is for all of these reasons that I do not want to encourage branching or have it as a major part of my workflow.

Feature branching can cause a lot of problems. A developer working on a branch is working alone. They might be frequently pulling in from master, but if everyone is working on their own feature branch, merge conflicts can still hit hard. Maybe they have set things up so that an automated build will still run through every push they make to that branch, but it's just as likely that tests are only being run locally and the minute this goes into master you'll see random failures due to the various gremlins of all software development. Worst of all, it's easy for them to work in the dark, shielded from the eyes of other developers. The burden of doing the right thing is entirely on the developer and good developers are lazy (or busy, or both). It's too easy to let things go for too long without code review, without integration, and without detecting small problems. From a workflow perspective, I want something that makes small problems come to light very early and obviously to the whole team, enabling inherent communication. Branching doesn't fit this bill.

Feature branching also encourages thinking about code and features as all or none. That makes sense when you are delivering a packaged, versioned product that others will have to download and install (say, Linux, or ZooKeeper, or maybe your iOS app). But if you are deploying code to a website, there is no need to think of the code in this binary way. It's reasonable to release code behind feature flags that is not complete but flagged off, for purposes of keeping the integration of that new code in for testing in other environments. Learning how to write code in such a way as to be chunkable, flaggable, and almost always safe to go into production is a necessary skill set for frequent releases of any sort, and it's essential if you ever want to reach continuous deployment.

Release branching may still be a necessary part of your workflow, as it is in some of our systems, but even the release branching parts of the git-flow process seems a bit overly complex. I don't see the point in having a develop branch, nor do I see why you would care about keeping master pristine, since you can tag the points in the master timeline where you cut the release branch. (As an aside, the fact that the original post refers to "nightly builds" as the purpose of the develop branch should raise the eyebrows of anyone doing continuous integration.)  If you're not doing full continuous deployment you need to have some sort of branch that indicates where you cut the code for testing and release, and hotfixes may need to go into up to two places, that release branch and master, but git-flow doesn't solve the problem of pushing fixes to multiple places. So why not just have master and release branches? You can keep your release branches around for as long as you need them to get live fixes out, and even longer for historical records if you so desire.

Git is great for branching. So what? Just because a tool offers a feature, and does it well, does not mean that feature is actually important for your team. Building a whole workflow around a feature just because you can is rarely a good idea. Use the workflow that your team needs, don't cargo cult an important element of your development process.

Make it Easy

One of my overriding principles is: make it easy for people to do the right thing.

This seems like it should be a no-brainer, but it was not always obvious to me. Early in my career I was a bit of a self-appointed build cop. The team I worked on was an adopter of some of the agile/extreme programming principles, and the result of that was a 40+ person team all working against the same code base, which was deployed weekly for 3 distinct business purposes. All development was done against trunk, using feature flags. We managed to do this through the heavy use of automated unit/integration testing; to check code in, you were expected to write tests of course, and to run the entire test suite to successful completion before checking in.

Unsurprisingly, people did this only to a certain level of compliance. It drove me crazy when people broke the build, especially in a way that indicated they had not bothered to run tests before they checked in. So I became the person that would nag them about it, call them out for breaking things, and generally intimidate my way into good behavior. Needless to say, that only worked so well. People were not malicious, but the tests took a LONG time to run (upwards of 4 hours at the worst), and on the older desktops you couldn't even get much work done while the test suite ran. In the 4 hours that someone was running tests another person might have checked in a conflicting change that caused errors; was the first person really supposed to re-merge and run tests for another 4 hours to make sure things were clean? It was an unsustainable situation. All my intimidation and bullying wasn't going to cause perfect compliance.

Even ignoring people breaking the build, this was an issue we needed to tackle. And so we did, taking several months improve the overall runtime and make things easier. We teased out test suites into specific ones for the distinct business purposes combined with a core test suite. We made it so that developers could run the build on distributed hardware from their local machine. We figured out how to run certain tests in parallel, and moved database-dependent tests into in-memory databases. The test run time went way down, and even better, folks could kick off the tests remotely and continue to work on their machine, so there was much less reason to try and sneak in an untested change. And lo and behold, compliance went way up. All the sudden my build cop duties were rarely required, and the whole team was more likely to take on that job rather than leaving it to me.

Make it easy goes up and down the stack, far beyond process improvements. I occasionally find myself at odds with folks that see the purity of implementing certain standards and ignore the fact that those standards, taken to extreme, make it harder for people to do the right thing. One example is REST standards. You can use the http verbs to modify the meanings of your endpoints and make them do different things, and from a computer-brain perspective, this is totally reasonable. But this can be very bad when you must add the human brain perspective to the mix. Recently an engineer proposed that we change some endpoints from being called /sysinfo (which would return OK or DEAD depending on whether a service was accepting requests), and /drain (which would switch the /sysinfo endpoint to always return DEAD), into one endpoint. That endpoint would be /sys/drain. When called with GET, it would return OK or DEAD. When called with PUT, it would act as the old drain.

To me, this is a great example of making something hard. I don't see the http verb, I see the name of the endpoint, and I see the potential for human error. If I'm looking for the status-giving endpoint, I would never guess that it would be the one called "drain", and I would certainly not risk trying to call it to find out. Even knowing what it does, I see myself accidentally calling the endpoint with GET, now I didn't drain my service before restarting it. Or I accidentally called it with PUT and now it's been taken out of the load balancer. To a computer brain, GET and PUT are very different, and hard to screw up, but when I'm typing a curl or using postman to call an endpoint, it's very easy for me as a human to make a mistake. In this case, we're not making it easy for people using the endpoints to do the right thing, we're making it easy for them to be confused, or worse, to act in error. And to what benefit? REST purity? Any quest for purity that ignores human readability does so at its peril.

All this doesn't mean I want to give everyone safety scissors. I generally prefer to use frameworks that force me and my team to do more implementation work rather than making it trivially easy. I want to make the "easy" path the one that forces folks to understand the implementation to a certain level of depth, and encourages using only the tools necessary for the job. This makes better developers of my whole team, and makes debugging production problems more science than magic, not to mention the advantage it gives you when designing for scale and general future-proofing.

Many great engineers are tripped up by human nature, when there's really no need to be. Look at your critical human-involving processes and think: am I making it easy for people to do the right thing here? Can I make it even easier? It might take more work up front on your part, or even more verbosity in your code, but it's worth it in the long run.

Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper

This is the written version of a presentation I made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

The Problem:
Create a "dynamic discovery" service for a global company. This allows servers to be found by clients until they are shut down, remove their advertisement, or lose their network connectivity, at which point they are automatically de-registered and can no longer be discovered by clients. ZooKeeper ephemeral nodes are used to hold these service advertisements, because they will automatically be removed when the ZooKeeper client that made the node is closed or stops responding.

This service should be available globally, with expected "service advertisers" (servers advertising their availability, aka, writers) able to scale to the thousands, and "service clients" (servers looking for available services, aka, readers) able to scale to the tens of thousands. Both readers and writers may exist in any of three global regions: New York, London, or Asia. Each region has two datacenters with a fat pipe between them, and each region is connected to each other region, but these connections are much slower and less tolerant for piping large quantities of data.

This service should be able to withstand the loss of any one entire data center.

As creators of the infrastructure, we control the client that connects to this service. While this client wraps the ZooKeeper client, it does not have to support all of the ZooKeeper functionality.

Implications and Discoveries:
ZooKeeper requires a majority (n/2 + 1) of servers to be available and able to communicate with each other in order to form a quorum, and thus you cannot split a quorum across two data centers and guarantee that the quorum will be available with the loss of any one data center (because at least one data center will fail to have a pure majority of servers). To sustain the loss of a datacenter therefore you must split your cluster across 3 data centers.

Write speed dramatically decreases when the quorum must wait for votes to travel over the WAN. We also want to limit the number of heartbeats that must travel across the WAN. This means that both a ZooKeeper cluster with nodes spread across the globe is undesirable (due to write speed), and a ZooKeeper cluster with members only in one region is also undesirable (because writing clients outside of that region would have to continue to heartbeat over the WAN). Even if we decided to have a cluster in only one region, we would have to solve the problem that no region has more than 2 data centers, and we need 3 data centers to handle the loss/network partition of an entire data center.

Create 3 regional clusters to support discovery for each region. Each cluster has N-1 nodes split across the 2 local data centers, with the final node in the nearest remote data center.

By splitting the nodes this way, we guarantee that there is always availability if any one data center is lost or partitioned from the rest of the data centers. We also minimize the affects of the WAN on write speed by ensuring that the remote quorum member is never made into the leader node, and the general effect of the majority of nodes being local means that voting can complete (thus allowing writes to finish) without waiting for the vote from the WAN node in normal operating conditions.

3 Separate Global Clusters, One Global Service:
Having 3 separate global clusters works well for infrastructural reasons mentioned above, but it has the potential to be a headache for the users of the service. They want to be able to easily advertise their availability, and discover available servers preferably by those servers available first in their local region, and secondly in other remote regions if no local servers are available.

To do this, we wrapped our ZooKeeper client in such a way as to support the following paradigm:
Advertise Locally
Lookup Globally

Operations requiring a continuous connection to the ZooKeeper, such as advertise (which writes an ephemeral node) or watch are only allowed on the local discovery cluster. Using a virtual IP address we automatically route connections to the discovery service address of the local ZooKeeper cluster and write our ephemeral node advertisement here.

Lookups do not require a continuous connection to the ZooKeeper, and so we can support global lookups. Using the same virtual IP address we can connect to the local cluster to find local servers, and failing that use a deterministic fallback to remote ZooKeeper clusters to discover remote servers. The wrapped ZooKeeper client will automatically close its connection to the remote clusters after a period of client inactivity, so as to limit WAN heartbeat activity.

Lessons learned:
ZooKeeper as a Service (a shared ZooKeeper cluster maintained by a centralized infrastructure team to support many different clients) is a risky proposition. It is easy for a misbehaving client to take down an entire cluster by flooding it with requests or making too many connections and without a working hard quota enforcement system clients can easily push too much data into ZooKeeper. Since ZooKeeper keeps all of its nodes in memory, a client writing huge numbers of nodes with a lot of data in each can cause ZooKeeper to garbage collect or run out of memory, bringing down the entire cluster.

ZooKeeper has a few hard limits. Memory is a well-known limit, but another limit is the number of sockets for a server process (configured via the ulimit in *nix). If a node runs out of sockets due to too many client connections, it will basically cease to function without necessarily crashing. This is not surprising for anyone that has experienced this problem in other Java servers, but it is worth noting when scaling your cluster.

Folks using ZooKeeper to do this sort of dynamic discovery platform should note that if the services you are advertising are Java services, a long full GC pause can cause their session to the ZooKeeper cluster to time out and thus their advertisement will be deleted. This is generally probably a good thing, because a server that is doing a long-running full GC won't respond to client requests to connect, but it can be surprising if you are not expecting it.

Finally, I often get the question of how to set the heartbeats, timeouts, etc, to optimize a ZooKeeper cluster, and the answer is really that it depends on your network. I really recommend playing with Patrick Hunt's zk-smoketest in your data centers to figure out sensible limits for your cluster.

On Fit and Emotional Problem Solving

One of the biggest challenges Rent the Runway has is the challenge of getting women comfortable with the idea of renting. That means a lot of things. There's questions of timing, questions of quality. But the biggest question by far is the question of fit. Our business model, if you are unfamiliar, is that you order a dress typically for a 4 day rental period, which means that the dress comes very close to the date of your event, possibly even the day of that event. If it does not fit, or you don't like the way it looks on you, you may not have time to get something else for the occasion. As a woman, this uncertainty can be terrifying. Getting an unfamiliar item of clothing, even in 2 sizes, right before an event important enough to merit wearing something fancy and new is enough to rattle the nerves of even the least fussy women out there. This keeps many women from trying us at all, and presents a major business obstacle.

Given this obstacle, how would you proceed? When I describe my job to fellow (usually male) engineers, and give them this problem in particular, their first instinct is always to jump to a "fit algorithm". I've heard many different takes on how to do 3D modeling, take measurements, use computer vision techniques on photographs in order to perfect an algorithm that will tell you what fits and what doesn't.

Sites have been trying to create "fit algorithms" and virtual fit models for years now, and none has really gained much traction. Check this blog post from 2011, about that year being the year of the  "Virtual Fit Assistant". Have you heard of these companies? Maybe, but have you or anyone you know actually USED them?

I would guess that the answer is no. I know that for myself, I find the virtual fit model incredibly off-putting. I trust the fit even less seeing it stretched over that smooth polygon sim that is supposed to be like me. Where are the lumps going to be? Is it really going to fit across my broad shoulders? The current state of 3D technology looks ugly and fake and I'm more likely to gamble on ordering something from a site with nothing but a few measurements or a model picture than one where I can make this fake demo. The demo doesn't sell me, and worse, it undermines my fit confidence, because it doesn't look enough like me or any real person and it makes me wonder how those failures in capturing detail will translate into failures in recommending fit.

I've come to realize in my time at this job that what engineers often forget when faced with a problem is the emotional element of that problem. Fit seems like an algorithmic problem, but for many women, there is a huge emotional component to trying things on. The feel of the fabric. The thrill of something that fits perfectly. The considerations and adjustments for things that don't. Turning fit into a cheesy 3D model strips all emotion from the experience, and puts it into the uncanny valley of not-quite-realness. I do think that someday technology will be able to get through the valley and provide beautiful, aspirational 3D models with which to try on clothes, but we aren't there yet. So what can we do?

At Rent the Runway, we've discovered through data that when you can't try something on, photos of real women in a dress are the next best thing. Don't forget that the human brain is still much more powerful than computers at visual tasks, and it is much easier for us to imagine ourselves in an item of clothing when we see it on many other women. This also triggers the emotional response much more than a computer-generated image. Real women rent our dresses for major, fun, events. They are usually smiling, posing with friends or significant others, looking happy and radiant, and that emotion rubs off on the viewer. It's not the same as trying something on in a dressing room, but it is like seeing a dress on your girlfriend and predicting that the same thing would look fabulous on you.

This insight led us to launch a major new subsite for Rent the Runway called Our Runway. This is a view of our inventory that allows women to shop by photos of other women wearing our dresses. It is driven by data but the selling point is emotional interaction. Learning to use emotional reasoning was a revelation to me, and it might be the most valuable engineering insight I've picked up in the last year.

Get Better Faster

I heard a very interesting piece of advice this week from my CEO, addressing a group of college students that were visiting our office. Her words went something like this:
"Most days you have 100 things on your to-do list. Most people, when faced with such a list, will find the 97 that are obvious and easy and knock them off before worrying about the 3 big hairy things on the list. Don't do that. The 97 aren't worth your time, it's the 3 big hairy things that matter."
I've been thinking about that bit of wisdom ever since I heard it. It seems counter-intuitive in a way. Anyone that has ever suffered from procrastination knows that sometimes you feel better, more able to tackle problems when you break them down into a todo list. You get little things done and make yourself feel accomplished. But the more I think about the advice from my CEO, the more I agree with it. Especially in an entrepreneurial setting, or in a setting where you are suddenly given far more responsibilities than you are used to having. Why? It all boils down to three little words: Get Better Faster.
Get better faster. That's what I've spent my last year trying to do. I went to a startup to grow, to stretch in ways that I couldn't stretch in the confines of a big company. And when I suddenly found myself running the whole engineering team, this learning doubled its speed overnight. Being an ok manager and a great engineer is no longer enough for me to do my job. I need to be an excellent manager, an inspirational leader, a great strategist and a savvy planner. And the engineer can't totally slack off, but she needs to be saved for the really nasty bugs, not implementing fun new features.
This has all taught me a difficult lesson: you get better faster by tackling the hardest problems you have and ignoring the rest. Delegate the things that are easy for you (read: the little things you do to feel good about your own productivity) to someone who still needs to learn those skills. Immerse yourself in your stretch areas. For me, this mostly means that I have to delegate coding and design details to the engineers working for me, I have to delegate the ownership of my beloved projects and systems to someone with the time to care for them. This is PAINFUL. I would call the last 3 months being mostly out of coding and in planning/management/recruiting land to be some of the hardest of my career. 
And yet, I'm doing it. I'm getting better. And it's not just me who is getting better. It's every member of my team that has had to step up, to fill in the empty positions of leadership, to take over the work I can't do, or the work that the person who took work from me can't do.
You'll never get better doing the easy stuff, checking off the small tasks. A savvy entrepreneur knows that the easy stuff can always be done by someone else, so let someone else do it. The hard problems are the problems that matter. 

Becoming the Boss

One of the reasons people go to work for startups is that sense that anything could change at any time. You could go big, you could go bust, you could pivot into a completely new area. About a month ago, I got my first taste of this when, following my boss's departure from the company, I found myself in the role of head of engineering. And what a change this has been.

Call it the Dunning-Kruger effect, or simply call it arrogance, but I think if you had asked me before I was put into this position whether I could do the job well, I would have told you certainly yes. Did I want the job? Not really. But I could totally do it if I had to. Sure, I've never had full responsibility for such a large organization before, but I'm a decent manager, I have leadership skills, and I know my technical shit. That should be enough.

Here is what I have learned in the last month. The difference between leading 6 people in successful completion of their tasks, technical guidance, and the occasional interpersonal issue is nothing like being responsible for 20 people delivering quality releases, keeping their morale up, knowing when things are going wrong in the technical, interpersonal or career sense, and having to additionally report everything to your CEO and heads of business. When there is no buffer in your department above you that people can go to when your guidance is lacking, the weight of that responsibility is 10 times what you ever expected. A sudden transition of leadership even in a solid organization such as ours stirs up long-simmering conflicts. I'm down one pair of ears to listen and mediate.

And then there's recruiting. Helping with recruiting, giving good interviews, and saying good things about the company is nothing like owning the sell process from the moment of first technical contact with the candidate over the phone or at coffee, through onsite interviews, and into a selling stage. Good candidates need to be coaxed, guided, and encouraged often several times before they even get in the door. And one bad interview with you, the head of the department, can sully the name of the whole organization to a person even if you didn't want to hire them. I know this, but that didn't stop me from conducting a terrible phone screen a few days ago where my stress and impatience showed through as rudeness to the candidate. I thought I knew how to recruit but one bad interview and I'm in my CEO's office for some clearly-needed coaching.

I have known for a long time that even in the lesser leadership roles I've held in the past, the things I say and do echo much larger than I expect them to. But that was nothing compared to the echoes from being the person in charge. My stress causes ripples of stress throughout the staff. When I speak harshly to people over technical matters, it is yelling even if I don't intend it to be. One snide comment about a decision or a design invites others to sneer at that decision along with me.

The best advice I've gotten in the past month has been from my mother, who told me simply to smile more. My echo can be turned into echoes of ease and pride and even silliness and fun if I remember to look at the positives as much as the negatives. When I remember to smile, even if I'm unhappy about a decision, I find myself able to discuss that decision without inviting judgement upon the person that made it. When I smile through a phone call with a potential recruit, I sell the company better. When I smile through my 1-1s people feel that they can raise concerns without worrying that I will yell at them. When I smile, I see people step up and they take on bigger responsibilities than they've ever had, and knock them out of the park over and over again, which makes me smile even more. A smile is the thing that keeps me tackling this steep learning curve of leadership. So I try to smile, and every week I learn more than I've learned in a month at this job or a year at my previous company. Because change is scary and hard, but in the long run, it's good.

The Science of Development

Despite the study of computing being termed "Computer Science", most working software developers will agree that development is more art than science. In most application code, the hard decisions are not which algorithm to use or which data structure makes the most sense. Rather, they tend to focus on how to build out a big system in a way that will let lots of people contribute to it, how to build a system such that when you go back to it you can understand what's going on, how to make things that will last longer than 6 months. While occasionally those decisions have a "right" answer, more often they are the type of judgement calls that fall into the art of computer programming.

There are, however, two very important areas in which science, and by science I mean careful observation and logical deduction, play a very important role. These two areas are debugging and performance analysis. And where developers are often good at the art of programming, many of us fall short on the science.

I've been meaning to write a blog post about what makes a person a truly great debugger. This will not be that post, but in thinking about the topic I polled some of the people I believe are great debuggers including my longtime mentor. His response was interesting and boils down the concepts very succinctly. Good debuggers have a scientific bent. Some people, when faced with a bug, will attempt to fix it via trial and error. If changing the order of these statements makes the threading problem go away, the bug must be solved by that change. Good debuggers, on the other hand, know that there is a root cause for errors that can be tracked down, and they are able to break down the problem in a methodical way, checking assumptions and observing behavior through repeated experiments and analysis.

It's a wise point, and one that I must agree with. If I observe myself debugging hard problems, a few things come out. One, I always believe that with enough patience and help I can in fact find the root of the problem, and I pretty much always do. Two, when I dig into a tough problem, I start keeping notes that resemble a slightly disorganized lab notebook. In particular when debugging concurrency problems between distributed systems, to get anywhere sensible you have to observe the behavior of the systems in correct state, and in problem state, and use the side-effects of those correct and incorrect states (usually log files) to build hypotheses about the root cause of the problem.

I've also been thinking lately about this scientific requirement in performance analysis and tuning. We're going through a major performance improvement exercise right now, and one of the things that has caused us challenges is a lack of reproducible results. For example, one of the first changes we made was to improve our CSS by converting it to SCSS. We were relying on New Relic, our site metrics monitor, to give us an idea of how well we improved our performance. This caused us a few problems. One, our staging environment is so random due to network and load that performance changes rarely show up there. Two, New Relic is not a great tool for evaluating certain kinds of client-side changes. Lacking any sort of reliable evaluation of performance changes in dev/staging caused us to make a guess, SCSS would be better, that took a week plus to pan out, and resulted in inconclusive measurements. We were looking for luck, but we need science, and our lack of scientific approach hampered our progress. Now we have taken a step back and put reproducible measures in place, such as timing a set of well-known smoke tests that run very regularly, and using GTMetrix to get a true sense of client-side timings. Our second performance change, minified javascript, has solid numbers behind it and we finally feel certain we're moving in the right direction.

Is there art in debugging and performance tuning? Well, there is the art of instinct that lets you cut quickly through the noise to the heart of a problem. Well-directed trial and error (and well-placed print statements) can do a lot for both debugging and performance analysis. But getting these instincts takes time, and good instincts start from science and cold, hard, repeatable, observable truths.

Being Right

One of the stereotypes of the computer industry is the person that knows they're right, and doesn't understand when it doesn't matter. I have been this person. I once got into a knock-down-drag-out fight with another engineer over the decision to make an API I had written accept case-sensitive data. He absolutely insisted that it must, even though pretty much any case that would require case-sensitivity would almost certainly mean abusing the system design for nefarious purposes. We argued over email. We argued over the phone. I would be damned if I would give in.

At some point my boss stepped in and told me I had to change it, because this person was one of the most important clients of the API and he didn't want to have strife between our teams. I was furious. I yelled at him. I told him we shouldn't negotiate with terrorists. It didn't make a difference, his decision was final. I made the change, and the other developer proceeded to abuse the system in exactly the ways I had predicted. But he also used it, which was all that mattered to my boss, and in retrospect, was more important for the success of the project than my sense of rightness.

As a manager, I find myself on the other side of the fray, as I negotiate with one of my own developers over doing something "right". The time it takes to do something "right" simply isn't worth the fight I would have to have with product, or analytics, or other members of the tech team. It's not worth the time we would spend debating correctness. (At this point, it's usually ALREADY not worth the time we've spent arguing with each other). It's not worth the testing overhead. It's not going to move the needle on the business.

No matter how much I say, you're right, but it doesn't matter, it doesn't sink in. They don't believe me that I know that they're right, that I agree with their technical analysis but that it's not enough to change my mind. They tell me that they understand that technical decisions are sometimes a series of non-technical tradeoffs, but in this case, why can't I see that this is just the right way to do it?

Here's what I wish my boss had spelled out to me, and what I hope I can explain to my own developers when we run into such conflicts in the future: I know where you are coming from. I have nothing but immense sympathy for the frustration you feel. I know that it seems like a trivial thing, why does that other team feel the need to insist that it would take them too much time to integrate, that they don't want to test it, why does that idiot say that it MUST be case-sensitive? Someday, you will be in my shoes. You will be worn down from fighting for right, and you will be driven to produce the best results with the most consensus and not distract everyone with the overhead of debating every decision. And you'll probably smile and think, well, that dude abused my system but he also drove adoption across the company, and I still resent my boss for not fighting for me, but I understand where she was coming from.