We were ready
After few weeks of hard work, we were finally ready. New feature was developed and tested, database was already changed to accommodate new data. Now it was a matter of just typing a few commands and our changes were going to be live in production.
It was also Friday.
Which of course made us hesitate 🙂
”Let’s wait until Monday. What if anything goes wrong today or during the weekend?”
We also had users using our service right now. When will they be gone? I know!
Let’s deploy over night!
Service is used only in our country, so this is an option. We have done this before. What’s so good about it is in case of issues impact is minimal. But we did not want to do this again.
When deploying in the night, or with any tight time window:
- You are often tired and under stress;
- Few people involved and available.
As a consequence of the above:
- More mistakes happen
- When issue emerges – there is often not enough knowledge available to resolve it quickly;
- Problems are just swept under the carpet, root cause analysis forgotten.
After our first hand experience with all of the above, we decided next deployment will be in regular working time.
So why not just stop everything?
The easiest solution. We just change everything at once. Service will stop or partially be broken, but it will be just seconds, so who cares?
Just seconds, if everything goes as planned.
And that is something you should never plan for. Mistakes and issues happen, it never hurts to prepare.
We are not even talking about the user experience here – which is going to suffer. We had integrations with other systems going on as well… So stop the service – not an option.
Now how do we do it?
Let’s first see what we want. We want system to remain usable, without pause. We want safely to deploy and test our changes and than just “flip a switch” and go live.
Solution was not that hard to see, as our architecture was on our side. Every server we have is doubled. Load is not the issue, one set of servers is enough to handle it. The reason we had it like this is to ensure availability if part of it goes down.
So we routed everything to use only one half of our servers. We modified our build scripts to deploy on the other half. It was easy and took just several minutes.
Yeah, we “invented” Blue-Green deployment 😉
Let’s call the set of servers that is still being used Blue. The rest – which is running, but no traffic goes through it we call Green.
With that in place we just deployed new version of our code on our Green servers only. We had two applications that needed changes – one coded in Java and one working on top of Node.js. Java ran fine, but of course nothing was going through it yet.
And here is the first moment all that preparation paid off. Node.js app deploy failed. There was a small error, quickly fixed, with no impact on production, which was still using our Blue servers only.
The best part – no stress. You will run a script and go talk with a colleague. Gone were the days of constant watching and crossed fingers – it was safe to fail now. Nothing will will use the server you are upgrading.
After we set up our Green servers with the new version, we were ready to test our changes for real.
The big change in this release was connected with our mobile app, so we routed some of the mobile traffic through our Green servers. A few lines changed in our nginx configuration was all it took. We tested and monitored – everything was fine. Satisfied, we switched to the Green servers completely. Now Blue part of our infrastructure was not receiving any traffic, so we proceeded to upgrade it and finally returned our load balancers to their initial state. All servers upgraded and receiving traffic.
So, deployment successful?
”Guys, I start to see some errors here and there. What happened?”
Why is everything slow?
Here is also the next moment we were glad we deployed in office hours – everybody was here and ready to help. We quickly found the issue was with our database. You know what we found there?
Many connections kept alive hours and even days after their work was supposed to be completed. If that ain’t undead I don’t know what is. When we deployed our new version we triggered a few more – enough to reach a threshold.
So we slayed the zombies, quest finally completed.
When I write this I realise we did not do proper root cause analysis of the database issue at the time. It was not the first time we had issues with it, while deploying. Still we did something this time – last time it was just a restart and everything was “fine”.
Speaking of database – with breaking db changes this way of deployment is not possible. We in fact changed the database, but those two additional columns will not hurt anything. We applied them the day before. In case you do have breaking changes – think how to work in smaller, non-breaking increments. Apply and deploy them one by one. Trick here is you need to plan this in advance. Once you start working with the assumption database is going to be stopped it is harder to go back and you will want to introduce even more breaking changes. This makes you deploy less frequently, in big increments, leading to more potential for error. Errors are also hard to investigate when you are not sure what exactly caused it.
Some takeaways from this experience
If you want easy, predictable, productive and stress free software delivery process strategies like Blue-Green deployment can help. But that’s just how you do it and not the only way.
In fact anything will do, as long as you follow a few general principles in the way you work.
The few major takeaways from all this for me:
- Make it safe to fail
- Deliver frequently, in small increments
Does not apply just to deployment, isn’t it?
Thank you for reading.