Yesterday in the lead up to the AWS summit I got a chance to attend a day of fun and learnings participating in the Auckland AWS Shine Devops day.
We were in the business of renting Unicorns (imagine my glee) which unsurprisingly as the demand grew for Unicorns as a service so did the system load, unexpected deployments, security shenanigans and everything in between. There was carnage and lots of it.
Unicorn Rentals was launched abruptly with a *ahem* "guys its live" heres a run book (of sorts) good luck! And before we had time to blink we were into game day mode with production traffic and customer transactions hitting us.
We had to scramble. Being paired with a few engineers who are total strangers with no shared ideas and values about how do a very very difficult job resulted in our team Magic Sparkles having our production environment unreachable or on fire for hours on end throughout the day.
The game was never meant to be easy. To be fair on us we were down a few key things from the get go. Let's talk about that ...
No established way of sharing information
My first comment to my team mates just seconds before we found out the site was live was "I think we need something like Slack or Hipchat, or even just google docs just to share information with each other?". There was a general agreement this was a good idea but was cast aside the second volumes of paying production customers started ordering their Unicorns.
We moved to verbal communication because after all we were sitting next to each other. Im sure I don't need to explain this didn't end up working in our favour. I think the most common thing we said to each other as things got rapidly hotter were "What are you doing now?" or "Have you done x yet?". Under pressure it got harder and harder to remember what had be done and what had not.
The greatest thing about slack and ChatOps in general is when you or other engineers are in trouble everyone can what's happening and especially teams 'on the outside' can see a call for help.
When our load balancer was dropping traffic and none of us knew what do I can't really begin to describe the despair of watching transaction after transaction fail and have no way to track our troubleshooting progress or call for help.
We had access to metrics. AWS is rich with all the metrics, monitoring and options to trigger on alerts. Everything a Devops team could ever want. But they were not setup, that is what the Devops crew for Magic Sparkles had to do, and quick.
It was not long before issues started to arise and we were reactively trying to figure out what do about it without the right information and we started actioning changes in production based on diagnostic theories. Once we found a live dashboard of customer transactions metrics pictured above I was at least able to speak to our relative health by saying we were up or down. Pretty dire.
Even then it was only once something had failed us and we were dropping transactions on the floor were we putting in our health checks. I'm mindful that this was all part of the of the game but It did make me reconsider how often we think about the metrics we actually need to stay standing let alone be proactively keeping our environment in good health and scaling appropriately.
Transparency for our small team again was an issue. I'm used to having all these critical metrics pushed into Slack for everyone to see. All engineers not just the Devops (or SRE) team are able to see right away if a change has negatively impacted production in someway and we can start to fix it.
Without this diagnosing the issues troubling us and taking appropriate action took 10x longer as no one was playing with a full deck of cards. What a nightmare. This needed to have been done well before we went live with anything. We were doing our best given the circumstances, in a 'this is fine' kind of way.
One thing that really stuck out for me was the absence of direction or leadership. This is not something I'm framing blameful way or a reflection on myself or my team mates. 3 complete strangers who have never worked together were hurled into a game of keeping a cloud production system from burning down was never going to be easy. But no one was driving.
I don't think anyone in my team felt comfortable on any level standing up and taking charge. Im a QA manager who has done some work for our SRE team on occasion and had my very first experiences with AWS on a tech essentials training course just days before. I can be a great contributor but I am not the guy you want leading a devops team on game day.
We did have experienced operations engineers in Magic Sparkles and we still struggled knowing if what we were going to do was the right thing.
A team leader would have given us steering and helped us asking the right questions to empower us to make sound decisions and solve the problem ourselves. Instead in a leadership vacuum the uncertainty and lack of game plan and coordination, all the things we really needed to function as a performing team. We really suffered ... and lets not forget our sad customers who couldn't rent Unicorns! If anything people retreated to the comfort of trying to be a good individual contributor.
AWS staffers saw us on fire and while cautious to not give us an unfair advantage assisted us by getting us to ask the right questions and supported us making the moves we needed to in a way a great team lead or tech lead would have to get us back up and processing customer transactions.
Team leadership in its own right is something I will talk about at length in other postings. What I can say is having to watch production burn and feel helpless as to the next steps made me really think about the value of the amazing tech leads I work with.
Poor run book and documentation.
Part of the game day challenge was inheriting a woefully inadequate set of technical documentation for Unicorn Rentals operations.
Having to figure out how our application worked and was built in AWS while it was on fire was not much fun for us or our frustrated customers. We needed documentation on application design, our deployment process, our metrics and had no run books for when things went wrong.
Many of the reactive infrastructure changes we made to try and put out the fires had no decision register, implementation / comms plan, back out plan or test support.
This was a frightening world to operate in. Thinking of myself as a new engineer to this environment we were not setup for success and it's highlighted how important it is to create documentation this for you and your fellow engineers.
We have a practise of linking specific run books from past operational changes to alerts being pushed into Slack. When engineers do this they are setting up everyone around them to help fix an issue that's happened before even if they are not there. Magic Sparkles could have really used some of this sort of documentation and support on the day.
What an excellent experience to be part of. A big thanks to the AWS AU/NZ team for putting it on for all of us. Besides the very key part in getting hands on time with AWS console and all the various products its given me a lot to think about in terms of what really matters in contributing to and running a great Devops / SRE team.
Did you go too? How did your team get on? Keen to chat, best place to find me is usually on twitter @SparkleOps