Why are you still deploying overnight?

10/17/2011 – Welcome Hacker News folks!

There’s some discussion happening in the comments; but, as always, the better conversation is on the article page on Hacker News itself.

“3:00 am Deployment?  Why Not?”

That was the Facebook status of a former co-worker about a week ago. I happened to be awake and online at the same time (he’s on the East Coast; it was only midnight here) and immediately responded, “The better question is, ‘Why?’”

Deployment.  Production Push.  Go Live.  Rollout.  Whatever you call the process of turning your development codebase into a live, production application, I sincerely hope you’re not living in the Stone Age and doing it in the middle of the night under the guise of avoiding customer impact. Unfortunately, if my past experiences, and the experiences of many I’ve spoken to, are the norm, you very likely are.  If your strategy to avoid customer interruption is based solely on trying to avoid your customers, you’re setting yourself up for even more headaches and long-term failure.

The motivations for these overnight deployments are suspect at best. The claim is that by avoiding the daylight hours, fewer customers will be impacted by the rollout.  Problem 1: You presume there will be problems that impact availability.  You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process.  If you lack confidence that your new system is ready for production, you probably shouldn’t be pushing it to production!  If you think your servers aren’t ready or that your deployment process stinks, take the extra time now to improve them. I’ve seen great companies with absolutely terrible build and deployment processes who have nightmares getting code into production.  At the same time, these same companies refuse to devote more than a single person (or maybe only part of his or her time) to improving that process.

Perhaps even more suspect in the reasoning is the notion that because the process is complicated and volatile, it should be done in off-hours. Problem 2: You’ve got a complicated process and you’re sending over-tired, over-worked people to deal with it.   Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals.  Do you want tired, stressed, and unmotivated people working on the process?  If deployment is one of your most complicated procedures, why are you sending your people at their worst to handle it?

Earlier, I mentioned that teams are simply attempting to avoid customers by deploying overnight. Aside from this being a futile goal for any global business (and this is 2011), it likely suggests you’re missing two things.  Problem 3: You have no means of doing a phased rollout or a quick rollback. Deployments in this world are likely one-way affairs with a lot of time devoted to pushing the new code and no clean way to revert those changes quickly if something goes South. Make no mistake, I’m not suggesting that deployments are easy (or even that they should be). Nor am I suggesting that everything should always be perfect when deploy code. However, attempting to sneak code out in the middle of the night is hardly meeting the challenge head on.

Compounding all of these issues is the fact that there are some problems you can only see as certain scale is achieved. By hiding from your customers during deployment, you may also be burying your head in the sand with regard to these potential bugs.

Plan For Success; React Quickly to Obstacles

There are several techniques I’ve seen employed that have had a great impact on improving the deployment process to the point teams have felt comfortable deploying while the sun is up.

Involve your QA team early so they have a full understanding of the feature and how to test it. Foster a partnership between QA engineers and developers so they work together to understand the full impact of the feature and ensure that your testing, especially regression, is thorough enough to develop high confidence in your quality.  Always remember that the later in your process bugs are discovered and fixed the more expensive it becomes (especially if these bugs make it all the way to production).  Incentivize your people around delivering quality early–not finding bugs late.

Devote time and energy to your deployment processes; don’t shunt them off onto one guy working in isolation. Establish an owner; but, make sure this person is integrated with the rest of the development team and understands their pain points and needs.  Automate complicated manual processes to prevent mistakes (you know, the type of mistakes that happen when a tired engineer is sitting at his or her console at 3:00 am).

Decouple various parts of your system so they can be deployed and rolled back independently. There’s no sense in having to take your checkout process offline simply because of a regression in your unrelated public API.  This concept is often easier said than done; but it’s incredibly important and worth your team’s investment in time.

Use feature kill-switches aggressively; allow certain parts of your application to be turned on and off via runtime configuration.  Deprecate old functionality rather than destroying it in your codebase. Allow the feature switch to revert to the old code paths without forcing a code rollback or additional push.  Once you’re confident in your new functionality, the deprecated paths can be removed in your next deployment. In cases where this concept is prohibitively difficult or even impossible, modularize the code containing that feature and run both so you can quickly switch back to the old code if necessary.

Avoid unnecessary deployments. When I talked to my friend mentioned above about his 3:00 am deployment, he told me they had to do it to end a contest that their website had been running.  I’m sorry; that’s just not a reason to have to push new code to production. Feature switches targeting alternate code paths could have solved that problem. Moreover, they could have been set on a timer such that the moment the contest ended, the entry form was disabled.  Even following every suggestion here and others you can find, deployments are never going to become easy, only “easier.”  Don’t saddle your team with more of them than you need.

Release early, release often. I realize this mantra has been repeated over-and-over for years; but, that’s only because it’s such important advice. By releasing new code to production often, you’re shrinking the size of each deployment. The less stuff that changes, the less that can go wrong.

Create a system for phasing your rollouts. It’s a much better way to reduce customer impact of issues than simply hiding from your customers. Take your time between each phase to really let issues surface. An example of such a plan would start with a small number; like 5-10%. This level of exposure is still likely less than 100% at 3:00 am; but it’s also likely large enough to alert you to any glaring issues quickly.  Once you’ve cleared that hurdle, ramp up to a number that gives you some level of scale; say 50%. This level keeps your customer impact somewhat diminished if anything goes wrong; and, it will expose issues that may not appear until your app is running “at scale” (such as a new API call taking far more cycles than intended because the caching isn’t working correctly.  You may not notice these extra cycles at low volume; or worse, you may simply write it off that the service isn’t seeing enough traffic to really warm the cache).  Once you’ve crossed that hurdle, you’re ready to ramp up to 100%. Each phase should be designed to contribute confidence along the way that once all customers have this new code, they’ll be getting the quality experience you intended to deliver.

Get Some Sleep (Or Maybe…)

Ultimately, deploying overnight is likely indicative of something (probably several somethings) being broken in your process.  Luckily, by considering your deployment process an important part of your product and devoting time and energy to it, your can turn overnight deployment into a thing of the past and reclaim those late nights for important things like sleep.

Or Adult Swim.  Your choice.

Kickoff to Delivery: The Most Difficult Phases

Something I’ve noticed over the years of being an engineer is that project staffing tends to look something like, “Let’s get our best and brightest involved early to get the project moving–once it’s moving, the rest of the team can finish the job.“  Part of this statement shows great sense for project delivery–and the other completely opposes it.  The two most difficult phases of any project are the initial kickoff and the final delivery–and you need a different set of people for each.

Looking at the philosophy above, it is generally understood that getting the ball rolling on a project is a difficult task.  Especially in the IT world where a project may call for some new technology that the majority of your staffers aren’t yet up to speed on, it’s great to bring people in who tend to be somewhat ahead of the curve.  They serve both to lay the initial groundwork in actual code–but more importantly they support the development of the staffers that will continue the work after they leave for the next project.

The best people for this phase are your leaders, your entrepreneurs.   They’re some of your top technical people who also bring strong technical leadership to your team.  They inspire others to follow them into the project and motivate the people around them.  Most of all, they are able to charge into complex technical problems and tackle some big challenges early on and clear those roadblocks for the team while providing solid examples of how the team should proceed.

As these people move off the project, it is assumed that the team “left behind” will bring the project to its completion. However, completing a project presents the same level of challenge as starting it.  The challenges are different, but they are equally complex: fixing defects, tying up loose ends, and the inevitable realization that something that worked every time you tested it up until now has a couple of edge cases that, while not common, are possible enough that they must be accounted for in a new design. As a project approaches that end date, another personality type is required to join the team.  I call them “closers.”

Closers are methodical, detail-oriented executors.  Winding a project to completion involves fixing often very delicate defects without destabilizing the entire system (or even large portions of it).  Rather than being solid team leaders, they are usually the absolute top technical talent you have.  They’re people who have seen nearly everything that can happen as launch approaches and can maintain their calm while dealing with the challenges.  Most importantly, these closers act as a buffer between management (which is usually in full-panic mode as the launch approaches) and the rest of the team working to knock out those last few defects before launch.

We already put a lot of effort into people and process to get a project started; it’s time to pay more attention to Closing the Deal.  

Emergence Drowns in a Waterfall

In July of this year I’ll be taking a class to become a certified ScrumMaster.  It will be the first industry certification that I’ve earned in my career–I’ve frankly never found them to be valuable so I’ve avoided them.  But as much as I’m passionate about building quality products, I’m even more passionate about the processes that enable building quality products.  And besides, what’s cooler than a title that identifies me as Master.

The classical model of software development follows the Waterfall Model–that is, each process completes a set of deliverables and they “fall down” to the next tier.  At least, that’s the theory.  In practice it usually works more like a set of walls that each group throws their deliverables over as they’re completed–and the team who receives that deliverable is asked to follow that deliverable explicitly.  The further “downstream” you are in the waterfall, the more frustrating this process becomes as you get further and further away from the true objective of the project.

Often we find that when we fill the requirements, we don’t really meet the objectives and happen upon a better way to meet the objective: this is called Emergence.  While attempting to reduce uncertainty downstream (often quite unsuccessfully), the Waterfall also restricts our ability to act on the Emergent Properties we discover in a system as it is built.  In my opinion, this restriction is the single largest cause of poor software design as it causes us to accept our requirements time leaps of faith over development time observations.

Emergent properties are very powerful tools to guide the way to great products and they should be embraced.  Scrum and other Agile methodologies support these properties while the Waterfall, with all pun intended, drowns them.