Viewing entries tagged
Devops

The trouble with Unicorn Rentals - AWS Shine DevOps Day 2016

The trouble with Unicorn Rentals - AWS Shine DevOps Day 2016

Yesterday in the lead up to the AWS summit I got a chance to attend a day of fun and learnings participating in the Auckland AWS Shine Devops day.

We were in the business of renting Unicorns (imagine my glee) which unsurprisingly as the demand grew for Unicorns as a service so did the system load, unexpected deployments, security shenanigans and everything in between. There was carnage and lots of it. 

Unicorn Rentals was launched abruptly with a *ahem* "guys its live" heres a run book (of sorts) good luck! And before we had time to blink we were into game day mode with production traffic and customer transactions hitting us.

We had to scramble. Being paired with a few engineers who are total strangers with no shared ideas and values about how do a very very difficult job resulted in our team Magic Sparkles having our production environment unreachable or  on fire for hours on end throughout the day. 

The game was never meant to be easy. To be fair on us we were down a few key things from the get go. Let's talk about that ...

No established way of sharing information

My first comment to my team mates just seconds before we found out the site was live was "I think we need something like Slack or Hipchat, or even just google docs just to share information with each other?". There was a general agreement this was a good idea but was cast aside the second volumes of paying production customers started ordering their Unicorns. 

We moved to verbal communication because after all we were sitting next to each other. Im sure I don't need to explain this didn't end up working in our favour. I think the most common thing we said to each other as things got rapidly hotter were "What are you doing now?" or "Have you done x yet?". Under pressure it got harder and harder to remember what had be done and what had not.

The greatest thing about slack and ChatOps in general is when you or other engineers are in trouble everyone can what's happening and especially teams 'on the outside' can see a call for help.

When our load balancer was dropping traffic and none of us knew what do I can't really begin to describe the despair of watching transaction after transaction fail and have no way to track our troubleshooting progress  or call for help. 

Metrics?

We had access to metrics. AWS is rich with all the metrics, monitoring and options to trigger on alerts. Everything a Devops team could ever want. But they were not setup, that is what the Devops crew for Magic Sparkles had to do, and quick. 

It was not long before issues started to arise and we were reactively trying to figure out what do about it without the right information and we started actioning changes in production based on diagnostic theories. Once we found a live dashboard of customer transactions metrics pictured above I was at least able to speak to our relative health by saying we were up or down. Pretty dire.

Even then it was only once something had failed us and we were dropping transactions on the floor were we putting in our health checks. I'm mindful that this was all part of the of the game but It did make me reconsider how often we think about the metrics we actually need to stay standing let alone be proactively keeping our environment in good health and scaling appropriately. 

Transparency for our small team again was an issue. I'm used to having all these critical metrics pushed into Slack for everyone to see. All engineers not just the Devops (or SRE) team are able to see right away if a change has negatively impacted production in someway and we can start to fix it.  

Without this diagnosing the issues troubling us and taking appropriate action took 10x longer as no one was playing with a full deck of cards. What a nightmare. This needed to have been done well before we went live with anything. We were doing our best given the circumstances, in a 'this is fine' kind of way.

No leadership

One thing that really stuck out for me was the absence of direction or leadership. This is not something I'm framing blameful way or a reflection on myself or my team mates. 3 complete strangers who have never worked together were hurled into a game of keeping a cloud production system from burning down was never going to be easy. But no one was driving. 

I don't think anyone in my team felt comfortable on any level standing up and taking charge. Im a QA manager who has done some work for our SRE team on occasion and had my very first experiences with AWS on a tech essentials training course just days before. I can be a great contributor but I am not the guy you want leading a devops team on game day.

We did have experienced operations engineers in Magic Sparkles and we still struggled knowing if what we were going to do was the right thing. 

A team leader would have given us steering and helped us asking the right questions to empower us to make sound decisions and solve the problem ourselves. Instead in a leadership vacuum the uncertainty and lack of game plan and coordination, all the things we really needed to function as a performing team. We really suffered ... and lets not forget  our sad customers who couldn't rent Unicorns! If anything people retreated to the comfort of trying to be a good individual contributor. 

AWS staffers saw us on fire and while cautious to not give us an unfair advantage assisted us by getting us to ask the right questions and supported us making the moves we needed to in a way a great team lead or tech lead would have to get us back up and processing customer transactions.

Team leadership in its own right is something I will talk about at length in other postings. What I can say is having to watch production burn and feel helpless as to the next steps made me really think about the value of the amazing tech leads I work with. 

Poor run book and documentation.

Part of the game day challenge was inheriting a woefully inadequate set of technical documentation for Unicorn Rentals operations. 

Having to figure out how our application worked and was built in AWS while it was on fire was not much fun for us or our frustrated customers. We needed documentation on application design, our deployment process, our metrics and had no run books for when things went wrong. 

Many of the reactive infrastructure changes we made to try and put out the fires had no decision register, implementation / comms plan, back out plan or test support. 

This was a frightening world to operate in. Thinking of myself as a new engineer to this environment we were not setup for success and it's highlighted how important it is to create documentation this for you and your fellow engineers.

We have a practise of linking specific run books from past operational changes to alerts being pushed into Slack. When engineers do this they are setting up everyone around them to help fix an issue that's happened before even if they are not there. Magic Sparkles could have really used some of this sort of documentation and support on the day. 

Wrap up

What an excellent experience to be part of. A big thanks to the AWS AU/NZ team for putting it on for all of us. Besides the very key part in getting hands on time with AWS console and all the various products its given me a lot to think about in terms of what really matters in contributing to and running a great Devops / SRE team. 

Did you go too? How did your team get on? Keen to chat, best place to find me is usually on twitter  @SparkleOps

Sharing with Slack <3

Sharing with Slack <3

I've been looking at some of the Slack teams I participate in recently and been observing the various channels which promote positive information sharing in a culturally significant way. 

Slack has changed the way we work significantly for the better with more transparency, speed of communication and less meetings.

Project teams have a specific place to organise thoughts and talk through project matters (especially great if you have remote workers or 3rd party suppliers).

From  a DevOps culture perspective a vast amount of metrics and alerting can also be piped into specific channels. Great for unlocking that information for all to see and allow anyone in the business to take note and begin collaboration. 

These are the more obvious uses of slack. What I wanted to share were just a few of the nice channel additions I've seen in teams which promote a culture of information sharing and positive collaboration / learning in maybe less obvious ways. 

1. #TIL (Today I learned).

Having a channel where everyone can share anything they learned today is great for engineering and product teams. Odds are if you learned something today about a nuisance or improvement that can be made, other people in your team will benefit from a headsup.

2. #Thanks 

This came to mind specifically after talking with @lady_nerd from safestack.io in where her team had created a #thanks channel to call out great things done by or help given to their peers.

Think about this for a moment. When someone gives you a high five publicly it's a massive lift right? Are we going so fast now we don't have time to write the odd little love note to the awesome people around us? I hope not.

Its also means if you get some love from customers via customer support, Twitter or other channels from users drop it in your #thanks channel. Its awesome when you're in the thick of it to know your team's work is hitting its mark with the users and customers out there in production. 

Letting people know they value each other's input is important!

3. #Guilds

Testing guild, UX guild, and front end guild? Do you have channels like these? I really enjoy having a space thats not dedicated to the day to day or project specific discussions but instead to a general channel where you can talk about improving your craft as a team or community. Its up to you what goes in here but simply making such a space available will hopefully encourage contribution and collaboration. 

4. #Readings #blogs #RSSfeeds

Now I get there are many ways to skin the cat on keeping up to date  on your favourite tech blogs and news sites. There are many things like Feedly just for this. Why is this even a culture thing?

Well if you value sharing sources of learning with everyone it's better to do this as a community in a topic specific slack channel. You can invite other members with a common interest in and share your sources of good reads and in turn have them do the same for you.

I started channels for topics I'm interested in devops and security. In particular for security I post in NIST NVD / US CERT digest alerts to surface specific vulnerabilities I need to know about.

When I started surfacing these people began to ask where I found this information. I invited them to the channel and now they are eyes on the same material. Since i've been given many great additions to the list of things to pipe into my readings channels from other members. 

I use IFTTT to watch the RSS feeds I care about and have a recipe slack post when a new article lands. 

Now one word of caution here sharing information is great but be careful how you do it. Your company slack contains sensitive information so you need to be very careful about the bots and external services you elect to integrate with it and what access they have. Before connecting anything to slack make sure you know what your giving it the keys to!

Im keen to hear some of the channels you guys have in your slack teams which open up information sharing and collaboration.

Lets chat? I'm usually found on twitter @SparkleOps

Hello there ! Starting to talk about engineering culture.

Hello there ! Starting to talk about engineering culture.

When I decided I wanted to start writing I had a great many things I thought I wanted to talk about, share and discuss. Once I left all these topics and ideas to tumble dry in my head for some days one of the most important things to me was engineering and company culture in tech. But why?

Culture is (in part) the beliefs and values which shapes how and what drives us to do our craft with our peers. What values and beliefs are going to enable us operate a happy motivated and successful team? How do we uphold them? All worth talking about! 

Heres something that really stood out as an example of why culture really really matters.

I was recently talking specifically about devops practises and culture with @petegoo, in which  he introduced me to this O'Rielly Velocity conference talk '10+ Deploys Per Day: Dev and Ops Cooperation at Flickr' by John Allspaw and John Hammond.

Its very easy to get lost in the weeds of the specific development techniques or operations processes that fused to gave light to the devops culture at Flickr. But if you watch through this talk there is a strong emphasis on the culture of trust, mutual respect, lowering risk and fear around change. Without which its hard to image the developers and operations people creating such a great set of tools, systems and collaborative processes to build Flickr.

You can have the most eloquent continuous integration implementation, funky slackbots chirping away all the key metrics and alerts and your own version of Etys morgue for your post-mortems ... whatever. If the right culture is not behind it driving all your people its going to be hit and miss. 

Im on a quest to learn and chat with others tech people to get a better understanding of what makes a great engineering culture. Anywhere in tech, be it QA, Security, Product I want to see what is working and what is not and why.

This is my place to share those learnings and talk with you about your thoughts and findings too.