Time for a little reflection. What better time than the end of the year?
More than an year ago I joined a small company with a strong engineering team. I have since been impressed by the amount of stuff so few people can handle and achieve. So this story is as much about me, as it is about them.
I wanted to write about something that had a big impact in my work and the way I approach software. And if I have to point one technical related thing that did that, that would be what I often jokingly refer as “server gardening”.
Server gardening – the beginning
Even before I joined the team, when our product was launched, the team had almost no server administration experience. The logical thing to do was to outsource this work. The guy doing the job did great! Not only he set it up, but also documented the initial setup and scripts well.
So servers were up, but the knowledge of how to work and maintain this infrastructure was still missing and will have to come later.
This is important, because it is the nature of software, that it is always changing. That means our infrastructure will change as well.
Yet what we had at that moment of time was not optimized for changes. It was optimized for documentation and initial setup of servers.
So what was our change process after we went live? Simple:
– Apply change in production
– Update documentation
– Update script repo
As you probably have imagined, direct consequence was that only the first action was the one that was executed for sure – “apply change in production”.
It was a matter of time before things go wrong.
It happened when a colleague deleted everything on one of our servers by mistake. The ONLY server that was not documented / scripts stored in repo. There was no impact on production, but we lost days recovering.
We definitely needed some changes in the way we handle this.
We dealt with this initially by responding to issues, trying to do our best and consulting with the guy that set it up often. After several months of that, we finally took a more proactive approach.
We gathered and discussed a list if improvements. Here is a short list:
– Version control everything
– Minimize manual work
– Scripts in repo to reflect actual state
– Safe place to try things out
Let’s go in more detail of what was our initial state and why a change was actually needed.
Version control everything
Perhaps you are wondering what we had that was not version controlled? Here are a few examples:
– deployment scripts
– some web server configurations
– database migrations
– developer related tools
It was not that those things were not stored anywhere – of course they were! Usually that was a shared folder and we had a system to version them. Still, it was again optimized for storage, but not for changing.
Let’s look into one of these with more detail. Like database migrations.
When I needed to set up new development environment, I got the script from where they were stored and applied them to my local database copy. I noticed it could use an index here or there. It was a surprise to me that was missing, as my colleagues are amazing with databases. So I asked about them. Everybody was surprised. Why? Because indexes were there! In production. In our beta environments. On other developers machines. But not in our shared folder.
Why are scripts not there as well? Well, because…
Minimize manual work
We executed a lot of stuff by hand.
Let’s look at database migrations again. There are plenty of tools that could be used. We were not using any of those, so had to rely on a manual process. Naturally at some point somebody forgot to update the documentation.
It was a pattern in the way we worked and definitely did not apply only for database migrations.
We needed the correct tools to solve this problem.
Scripts in repo to reflect actual state
We had all those bash scripts that can set up a server from scratch. What they cannot do is apply just a small change at the end. This is why we executed these manually.
We needed indempotency – executing the same script once or multiple times to always produce the same state.
While this is not impossible to do with bash, it requires a lot of work.
This exactly why we chose Ansible. When we needed something changed, we will implement an ansible script and execute it. Ansible scripts are not always idempotent out of the box, but this is a concept that is part of the tool philosophy. As a rule, Ansible will try to change only things that really needed to change and do nothing in other case.
But to be able to actually develop anything quickly, we needed one more thing:
Safe place to try things out
We needed to learn, not just “deliver servers”, as this is what got us into the situation we were. It was not bad actually – we had servers up and running, which was great! But managing them was another story and a source of frustration.
To learn, we needed a safe place where we can “play”. Production is not a place to learn, play and experiment.
It was obvious we needed a Virtual machines and to manage those we chose Vagrant.
If you are interested, I have written about my first experiences with Vagrant.
First – this was not implemented in one go. It was a long and slow process, as we had many other things to do in the last year – like develop software to put on those servers.
We saw almost immediate impact from storing everything in repository. Same happened with database migrations and Vagrant. There wasn’t much to learn there, it was just a few practices and simple tools that were easy to document and integrate in our development process.
Where things were slow was managing actual infrastructure. It was not because of the tools we used. We just had plenty of servers and a lot to learn. Slowly we began to understand what we are doing, wrote more and more scripts and eventually started reusing them successfully.
Many of our existing bash scripts were converted to Ansible.
A new server garden
Than we had to set up a whole new “server garden”.
We managed to set up a whole new infrastructure in another cloud. There we automated almost everything – from infrastructure definitions (server types, DNS records, etc), to deployments.
We could easily test almost everything on our local development Vagrant machines before applying it to the cloud.
We actually reused a lot of the code developed so we can make smaller Vagrant machines and run “production copies” on our machines. This gave us a lot of fast feedback while developing. If you are interested, I have written something about that in this post.
Instead of installing and configuring a database on my developer machine, I can now launch a virtual machine, configured similarly or even exactly like our production. Even better – I can do this for any server we have deployed, as we reuse all of our deployment scripts – same Ansible playbooks that build and deploy our production code will be used to create our local development virtual machines. This was not just easy, but trivial now!
Mistakes were made
All this sounds great, but it took a lot of time and effort. Mistakes were made, some more than once, here are some of them:
Trying to learn too much at once
You cannot figure out things like Ansible, Cloud services, other tools like Packer and make real progress at the same time. Of course this is exactly what we tried to do. We tried to find help again, but definitely not hard enough.
Much head banging against virtual walls was the direct result of this.
Use the right tool for the job
If all you have is a hammer, everything looks like a nail
Our hammer in this case was Ansible, and the problem nail – cloud infrastructure.
Ansible is a great tool and one you can use to manage your infrastructure just fine.
Problem is, we tried also to define our infrastructure with it. We eventually succeeded, but it is just not the tool for that as well. If we had to do it again and choose a tool for the job, we will probably go with Terraform. Our reasoning at the time we really did not want to try yet another thing. Ansible however is not great at this.
Why not Docker?
Everybody is talking about Docker these days and here I am talking about other boring stuff.
Docker is probably the direction we should move in the future. I am experimenting with it on my local dev machine and like what I see. But this was not the solution for us in the last year.
We are managing two separate infrastructures and want to keep those as close as possible. There is only so much new stuff our small team can handle 🙂
My career as a server gardener is doing fine, considering it started just an year ago. I need to juggle it between other responsibilities, but I am lucky to have awesome colleagues that I can always count on for help when needed. I know they are reading, so thank you, you are the best!