“This is not supposed to happen”
“My VPN is not working”
Not something you want to hear when you work late. It was 30 minutes past midnight. Our night deployment was not starting well.
I check my VPN connection – all is OK. Log on our VPN server, check my colleague configuration. Nothing suspicious. We restart her VPN – no change. Create new configuration, restart the service? Nothing.
Looks like I am performing our “10-15 minute deployment” on my own tonight. And that “10-15 min” are already in the past.
Once upon a time
Once upon a time, in the big cloud on the Internet, lived an application. People can log into the application with their usernames and passwords and access it via the web. Mobile apps were added and people dutifully logged in with their username/pass as well.
Until one day username/pass was not enough.
But wait, there is more!
People that use the application can own a small plastic card that can be used to identify them as users. They can use the cards at different places, receiving various stuff as a reward.
Until one day those small cards were not enough. Users needed other ways to identify themselves as well.
So we changed the application. We made it more extensible. It had to support everything we throw at it. Other types of authentication? No problem – what you’d like to have? Replace cards with something else? As many different things as you want.
Code was changed, the underlying database was changed as well. Many tests were written, rewritten and performed. Data migration scripts were created and executed against real data extracted from production. Everything was put on our beta server and tested some more. And everything went smooth and fast.
Until we decided to deploy it for real.
We made sure we migrate step by step and during the night, because no new data was coming in the system – it is in “read-only” mode. We had a number of steps we wanted to perform and a plan to execute between me and my colleague.
Plan was simple:
1. Extract all production data and migrate it at a separate place, so we are sure everything is OK with the migration
2. Backup everything
3. Migrate real production data
4. Redeploy applications depending on new data
5. Check everything is OK
We have prepared well in advance. We did not expect issues.
Famous last words, I know 🙂
When something goes wrong
With my colleague out, I made sure everything was OK on my side, so I can do everything on my own. I collected from her whatever commands and data I did not know at the time.
Perform a database backup? OK
Run the new migration against the backup? OK
Perform the migration against the real DB? It was taking a few minutes, but I saw first few scripts executed successfully.
Wait. Why is that script taking so long? Is our DB server so much slower than my dev machine?
“Guys, something is wrong”
Quick investigation showed the script was stuck while creating a simple table. What?
It was the first command in the script and nothing else was executed after that.
I checked and rechecked everything. Ran scripts again on my machine, tried again in production. Fine on my machine, still stuck on production. Why can’t it just work?
Perhaps something with the data backup? Is DB even working?
Yes, everything is working. I can select data and it ran fine and fast, as always.
Perhaps I can execute this script manually instead using our db-migration tools? Still stuck.
Can I even create this table? Tried a few commands, each one more simple than the other. Even the most simple one failed. Database will just refuse to create a table with that name for some reason.
Can I create other tables? Yes.
It was late in the night by that moment. Everybody was tired. In the end we just restarted the database and everything was suddenly fine.
Migration finished fine.
Deployment of applications ran fine.
All checks showed everything is fine.
What was not fine is it took a few hours in the middle of the night…
Deployments are scary
Every time we developers make a big change we are afraid to deploy it. In our case we needed two big changes, so we decided to reduce the pain and deploy only once. This is however not fixing the issue and makes your job even harder – you just go through the hard part once, instead of twice. Still, its a net gain, right?
But what could be the solution? When something is painful, do it more often. Improve it, until it’s easy to do. If you do it often enough you will naturally want to make it easier after all. That also means you need to develop your code with deployment in mind. This is not how we usually approach deployment. Putting stuff on production is something we usually think about only after everything is written. This makes our job much harder.
So the new rule is – you wrote it, you deploy it 😉
Its all about feedback
There are many ways to do this, but in one sentence – do not do everything at once. What you need is continuous deployment.
If you need to change database – do it well in advance. Make sure you do not delete anything in db. You need to move data? Duplicate it. Leave the delete part for after everything else is finished.
Write and deploy code that uses your new DB next. But do not switch your code on yet – make sure its a code path that is not executed by real users. Also make sure its a code path that is definitely executed by other means – like write unit tests against it. You need to make sure your code is working, even if it is not “used” in production.
Apply one small change and turn your new code on. There are many strategies to make this happen. You do not have to do it on all servers as well. Invest in monitoring, so you know stuff broke immediately.
Do not forget to clean up old code and deploy that as well. Clean up DB from any extra tables/columns/etc you are not using.
It may look complicated and messy at first. It requires discipline and planning. But it is way less scary. Your code will be integrated and deployed quickly and painlessly. Your changes are small and give you immediate feedback. No more waiting for next release window! If something brakes, you know exactly what and can quickly fix it. How could you forget it? You wrote it yesterday, not few weeks ago. And the best part – it is done in your working time, not in the middle of the night, when you are tired and just want this to be over.
After all, if all hell breaks loose, you better be relaxed and calm.
Like we were at our next deployment, but that’s another story.
Thank you for reading.
Thanks for sharing Kostadin.
What I’ve learned from experience (like clearly you have) is that deployments never work out like you hope they will. Especially in environments where they are not like-for-like. With all the will in the world even when you try your best to create environments that are identical they never will be.
There will always be differences between environments. Databases, networks, sometimes even code that you didn’t bank on.
All you can do is put the best person on the job and that’s the guy or girl that built it in the first place.
We use Vagrant to fight the environment issues and are starting with Ansible as well. Target is – immutable infrastructure. Not there yet 🙂