The Monster that Eats Software Companies

There's a tired genre of books called the "Business Fable".

Best known by way of the classic "Who Moved my Cheese?", its a story of a bizarre fictional society which is a transparent analogy for a corporation.

source

If I ever publish a business fable, it will be a crossover business fable / horror story

See, one of the most fun developments I've had with adulthood is that I now love scary movies. It's a genre that's rich in symbolism. In a really great monster story, the monster can't just be a monster. It has to tap into some deep-seated fear we share, and use the monster as a metaphor to examine and interpret that fear, and maybe even bring it to a point of understanding, of catharsis.

This would be the story of a strange island, where the inhabitants live in fear of a monster that stalks at night. The islanders first hear the rumor of this monster from a castaway, who tells a story of how it devoured his entire village. At first they laugh it off. But then, just to be safe, they stop leaving the village at night. Then a couple eerie incidents later, they decide its not safe to go out alone during the day either. They leave their village less and less often, and then only in big war parties. Eventually even the war parties become less frequent, for as the fear grows they need to spend more and more time sharpening their spears before they dare to venture out. Eventually all the islanders starve to death locked inside their village, and we're left wondering: Was there ever a monster outside? Or was the fear itself the real monster all along?????

Does this sound silly? It's a common trap for software companies to fall into.

Back when the whole company was less than a dozen people, with only one or two developers, things moved fast. They cranked out features in days which would take larger companies months. Sure they had problems: the UI was ugly, and it would freeze up with no explanation if you tried submitting the form without entering your date of birth. But it didn't matter, things were moving so fast. The developers had an intimate knowledge of every aspect of the software, and if there was ever a bug that really bothered customers, they could have it patched within hours.

It was a useful piece of software, and that usefulness brought growth: Now, there's a whole team of developers, many more features, and more customers who have come to rely on it.

And one day, they release a bug into production. And all hell breaks loose.

Customer support's phones are blowing up, the rollback procedure fails to fix it, and the next 24 hours are a frantic scramble to patch Humpty-Dumpty back together. When the dust settles, an angry executive summons a meeting with the engineers. "Look," he says, with a visible effort to contain his frustration. "I'm not interested in pointing fingers. I just want to know what we can do to ensure this never happens again."

As embarrassing as the bug is for engineers, its worse for the executive

He's the one who has to personally apologize to the customers. He's the one who gets a sick feeling in the pit of his stomach as he adds up how much those 24 hours with the whole site out of commission cost the company. And he feels profoundly frustrated with how little control he has over the whole thing. When the engineers try to even explain what went wrong, all that comes out is a bunch of excuses and technical gobbledygook. Best he can tell, it adds up to the programmer's equivalent of locking your keys in your car. A silly mistake; we're sorry; we'll try not to do it again.

There are solid strategies available to reduce the risk of deploying bugs like this. But the executive's first instinct is to seek safety by grasping for more of a sense of control. "We can't just release features willy-nilly like we have in the past," he says. "We need some documented procedures. We need a system of manager signoffs on any feature that goes into production, and an auditable paper trail attached to each. We need a formalized code review strategy and a written test plan, not just the developers kicking the tires a little and saying 'Looks good to me'"

All of these things ease the manager's anxiety

They give him a greater sense of control, and they sound like responsible things to do. There's a problem, though: most of these strategies are poor at catching bugs. They act like a security blanket, increasing the feeling of safety without increasing actual safety. They can even make the bug situation worse. They function mainly by making it harder to deploy any code at all, buggy or not. There's no longer an avenue to fix non-critical bugs, or to refactor code which is unstable but hasn't broken yet. Even where the fix is easy, the process to deploy the fix is so onerous that its not worth anybody's time.

The company can survive this. They already have a saleable piece of software, and it doesn't need to add a ton of new features to remain saleable. It's even possible these new rules will decrease the overall bug count, even if they only work by decreasing the amount of code that gets deployed at all.

But this approach to releases has the potential to spiral out of control

True disaster starts when every time the team the team deploys a bug or suffers another outage, new rules are imposed to prevent it from happening again. With a few cycles of this, the release process can grow to include things like

All code must be reviewed by at least 2 people, and if any bugs slip through their review, those reviewers will be publicly reprimanded
Code must be escalated through multiple testing environments, and tested by a different team of testers in each one
Every line of code must be covered by unit tests
An elaborate set of version control rules (including long-running branches), accompanied by an equally elaborate set of rules about who can merge into what
Frequent code freezes, where no new code is allowed to be merged except fixes for identified bugs
At each stage, documentation is required: Checklists of exactly which test cases were run, who reviewed what, which items are in which branch of source control
and, worst of all: If any bugs are discovered along the way, or if any steps are performed incorrectly, the process must be restarted

The core problem here is that the complexity of the process becomes more than any individual can keep track of. The rate of mistakes goes up, not down, as more rules are added, and more mistakes lead to more rules. Inevitably:

Release schedules slip

People will begin to say things like, "There's no way we can get this through testing in time for the June 1 release. We need to postpone until July". This makes the problem worse, because:

Stiennon's 8th^[1] rule: The amount of effort needed to perform a release increases exponentially with the time since the last release.

There is now a logjamb of features which have been built to completion, but have to wait at the back of a long queue before they are allowed to be released. Code like this which is complete, but not released has a strange tendency to rot on the vine. Even the engineers who built it start to forget the details about it: How it works, which pieces needed more testing; how it integrates with the rest of the code.

The team experiences a brain drain

The best, most marketable engineers are under no obligation to put up with all this nonsense, and they leave for greener pastures. In addition to the normal pain of losing a strong contributor, each departure sets back the release, as that engineer takes with him knowledge which was necessary to get the release out the door

Blame

All that paper trail generated by the executive's process does little to prevent bugs, but its very useful in assigning blame when bugs are inevitably discovered. Most line-level employees aren't really incentivized to care whether or not the company is profitable or experiencing tremendous growth. But they will go to extremes to avoid being humiliated in front of their peers or fired. It becomes safer to look busy and do nothing than to do something and risk a reprimand.

The collective actions of the company come to resemble a man with obsessive-compulsive disorder, whose compulsive rituals grow to occupy 100% of his available time. Under the weight of all this process, it's possible for the company to lose the ability to deploy any new software whatsoever

That sounds awful! But is the only other option to have an unstable product?

There's a bunch of techniques which both minimize bugs, AND keep the company nimble. But in reality, they are all different aspects of one idea:

You must have short release cycles

Notice this is the opposite of what the executive in our horror story tried to do. In essence, releasing code made him anxious, and being anxious made him timid. Short release cycles take a surprising amount of courage. There's less paper trail and there's less checking and double-checking. It's charging into the danger, rather than following your instinct and running away from it. But nonetheless, they are the answer. They both decrease the number of bugs (It's less likely to have a deployment disaster with a few small changes made in the last week, than in a gigantic release containing months of work), and more importantly, they make bugs easier to fix.

Stiennon's 9th rule: It's more important to have an easy avenue to fix bugs, than it is to release without bugs^[2].

But there's another subtle benefit at play. Frequent releases give the team more practice doing releases. It's harder to make a critical mistake in the process of deploying code if it's a simple procedure you did last week, rather than a elaborate procedure which you haven't done in months. As the team gets familiar with doing many, small releases, they're able to improve the process a little each time. Each release brings a new opportunity to streamline the process, or to try out a new automation technique.

Automated releases

Way back in the early days, when there was only one developer, the actual steps to release code might have gone something like this: The developer compiled the code on his laptop. Then, he connected to the production server by remote desktop or SSH, and copy-pasted the new build files over the old ones. If he was being extra cautious, he would rename the old build directory to production_old_version rather than deleting it outright. Then, he connected to the database, copy-pasted the schema update scripts from his laptop, and hit the enter key to run them. If any dependencies changed since the last deploy, he made sure to install those on the production server. Finally, he found any config files he needed to change, opened them up, and typed in the new values.

This sounds barbaric, but remember: There are Fortune 500 companies that deploy code this way.

It should come as no surprise that this is error prone. It gets progressively more error prone as the team grows, as the engineer performing it is now deploying code that other people wrote; as he now needs to perform the release on multiple servers. No matter how thoroughly the code was tested prior to the release, its still easy for our engineer to introduce catastrophic bugs by making a small mistake while he's doing the equivalent of open-heart surgery on the production servers.

The solution to this is automated deployments. If the deployment is performed by a script, it will run exactly the same every time. There's no opening for clumsy mistakes. Just as important, the very fact that your environments have been reworked to accommodate automated deployments tends to reduce the subtle discrepancies between the production, staging and development environments. It allows you to have confidence that if a feature worked on the stage environment, it will work when deployed to production. It helps the processing of deploying to become fearless.

In the last decade, an ecosystem has emerged of devops technologies orbiting around docker, which make consistent, automated deployments worlds easier. Docker deserves a whole post of its own, which I intend to publish in the near future.

Professional testers, not test plans

In a growing company, the leadership can be surprisingly reluctant to hire testers. Good testers are quite technical, they command high salaries, and they produce no quantifiable product. Couldn't we just, you know, write up a really detailed test plan, and have everybody pitch in to run the test plan in the week leading up to the release? Or maybe just outsource testing to India?

The problem is that testing is much like development. It's skilled, intuitive work, and can't be reduced to a set of procedures. Good testers will have written plans, but they looks more like

"Fill in the form with typical user data"
"Fill in the form with a user with a very long name"

than

Insert the name 'John' into the first name field
Insert the name 'Smith' into the Last name field
type '987-654-3210' into the phone number field

.....
Click the submit button.

They don't perform the test exactly the same every time, and that's precisely why they are effective. Most bug discovery doesn't look like "Test case #12-93-D failed". It involves the tester catching something out of the corner of his eye and thinking, "Huh. that looks a little funny. I'm going to see if I can make that happen again"

So how do you hire those great testers? That's a whole article in itself, and besides, Joel Spolsky has written it up better than I could

Automated regression tests

I almost hesitate to recommend this one, because its so easy to abuse. It's easy for software to become inflexible, to have bad early decisions become unchangeably baked into the design, by having too many automated tests or clumsily thought-out automated tests. Rules like, "All code must have at least 80% unit test coverage" are more likely to contribute to the death spiral than to stop it. But nonetheless, a small, carefully chosen suite of automated tests is an important part of being able to release fearlessly. They are no replacement for professional testers, but they can give a quick bit of feedback to assure you that the core features of the software aren't totally broken. They're not so good for proving that the car's AC keeps passengers comfortable, but they can demonstrate the the car still starts when you turn the key, goes when you push the gas, and stops when you hit the brakes.

Feature flagging

If your company has started to become mired by excessive process, you might at some point ask, "What's stopping us from doing it like we did back when the company was small? This would have taken a couple days then, why is it taking a month now?". The answer is that, back in those days, there weren't so many customers who would be affected by a problem. Most of the risk and anxiety around large deploys comes from how they affect every user at once. There's a simple way around this. Quietly release the feature, but have all the code for it wrapped inside of an if statement

There are tools to manage all this, but it can be something as simple as

if ('this_feature' in customer.features) {
	// New feature code
} else {
	// as things were before
}

Then, without doing a big release, just by altering your settings, you can flip on the feature for just a few customers. Big multi-national corporations take this further: They can deploy a feature to just a random .1% of customers, or only to New Zealand, and double check that their usage metrics don't fall off abruptly.

Demo early and often.

Ultimately, the executive from our story who tried to clamp down on bugs was barking up the wrong tree. Bugs and outages are certainly alarming. But in the end, most software products that fail don't fail because of bugs. They fail because the team built the wrong thing.

Even where the team is building the right thing, the majority of software problems aren't strictly "bugs": that is, cases where the software is obviously broken, freezing up or showing the infamous "Blue Screen of Death".

They are cases where the engineers built something which fulfills the requirements they were given, but it's still not quite right. It's unintuitive, or inconsistent, or it's just not a great approach to solving the problem at hand. The best remedy to all these issues is to have a constant feedback loop, which includes anybody with a stake in the software: managers, executives, sales, and marketing. This is really another aspect of releasing often, and can serve as a substitute when there are legitimate reasons to wait until a feature is more complete before literally putting it in front of customers.

This cannot be done effectively in a culture of anxiety over bugs, where everybody is trying to dodge the hot-potato of blame. If that's the case, engineers have an incentive to spend weeks or months tinkering and polishing before they demo, so they have a bulletproof case to prove they weren't negligent. This is the opposite of what you want. By demoing often, you get repeated opportunities to make the design of the software a little bit better. Features which sounded good on paper but don't work so well in reality can slough off, while new ideas can come from anywhere: Engineers can propose changes to the design, and the folks in Customer Support and Sales who interact with customers day in and day out can share their insights into what those customers are most likely to value.

Ultimately, though, these demos are second-best to the true test of software: feedback from actual customers. The truth is, as much as we plan and strategize, neither we nor the customers can know exactly which products are going to be hits, until those customers actually get their hands on it. This is the most valuable benefit of all of these short release cycles. It allows you to put more potential hits in front of customers, fast.

"What can we do to ensure this never happens again?" was always the wrong question.

Ask instead, "What are some ways we can limit the risk of critical bugs, while still remaining nimble?". Ultimately, those products that lose the ability to deploy new software, have discovered the only true way to ensure that bugs are never deployed in front of customers again.

There's no rule 1-7. Done in honor of Greenspun's 10th rule ↩︎
Essential caveat: the risk-reward tradeoff is a little different if you're working on the type of software where planes literally fall out of the sky if you get it wrong. But that's not the kind of thing that most of us are doing ↩︎