Monday, April 3, 2017

DevOps: Operations Can't Fail

Agile and DevOps are all about "Fail Fast", which is fine for developers, but absolutely unacceptable for Operations.

For a recent example, just look at the recent AWS outage:

http://www.recode.net/2017/3/2/14792636/amazon-aws-internet-outage-cause-human-error-incorrect-command

That was caused by someone debugging an application. None of us want our Operations department to be in that position, but it can obviously happen. I think there are one or more reasons behind why it happened, and I've got some opinions on how we need to work to ensure it doesn't happen to us:

Problem: Developers think Operations is easy

Absolutely everything labeled "DevOps" is aimed at allowing Development to do just enough "operations" to get by. But we in Operations know that it takes a lot more: Change and Configuration Management, Event Management, Business Service Modeling, and the list goes on and on. Individual Development teams don't necessarily understand these practices outside of their own application.

One Solution: We need to learn about "the new stuff"

The only way we'll be invited to the table to talk to development teams is to learn about the tools they're using (Jira, Puppet/Chef, Kubernetes, Docker, etc.). This will allow us to use a similar vocabulary when meeting with them. Without this basic knowledge, they simply won't invite us to any of their discussions.

Problem: Developers think Operations is unnecessary

Individual Development teams often don't see why the Operations department even exists. They have their tools that allow them to consistently deploy their application, so why does Operations need to be involved. They don't understand that any one of their 20-or-so "incidental" microservices may actually be absolutely critical to some other application in the environment.

One Solution: After learning the new stuff, ask to be involved

The Operations Manager needs to get involved with the Development teams. She needs to give Development teams some type of framework or process or SOMETHING that makes their application's metrics and availability visible to the Enterprise. This will allow ALL involved parties to understand the situation when there is an outage.

A great graphic from Ingo Averdunk at IBM


The parts in light blue (Logging, Monitoring, Event Mgmt, Notification, Runbook Automation, ChatOps and Root Cause Analysis) are those components that need to be standardized across all applications. If your Operations team isn't meeting with Development, you won't get to explain the need for the standard suite of tools.

There are other problems and other solutions

This post is meant to help Operations in a sea of DevOps information that is aimed only at Development, in the hope that we can reign things in and continue to ensure that the entire enterprise is healthy and available.

No comments: