The DevOps Dictionary (Part Three)

Marlo Vernon June 07, 2018

DevOps On-Call Post-Incident Review SRE
The-DevOps-Dictionary-Part-Three-Blog-Banner

Finally, the last part of The DevOps Dictionary is here! We covered letters A-O in parts one and two. Today, we’ll finish the series of essential terms and DevOps definitions with letters P-Z.

Pain Point

“A pain point is a problem, real or perceived. Entrepreneurs create opportunities for themselves by creating solutions to those pain points. Solutions create value for everyone.” (Points and Figures)

Phases of Incident Lifecycle

From Jason Hand’s O’Reilly Media Book about Post-Incident Reviews:

Detection

  • “Knowing about a problem is the initial step of the incident lifecycle. When it comes to maintaining service availability, being aware of problems quickly is essential. Monitoring and anomaly detection tools are the means by which service owners or IT professionals keep track of a system’s health and availability.”

Response

  • “Once a problem is known, the next critical step of an incident’s lifecycle is the response phase. This phase typically accounts for nearly three-quarters of the entire lifecycle. Establishing the severity and priority of the problem, followed by investigation and identification, helps teams formulate the appropriate response and remediation steps. Consistent, well-defined response tactics go a long way toward reducing the impact of a service disruption.”
  • “The response phase can be unpacked even further into the following stages (Figure 8-2): • Triage • Investigation • Identification”

Remediation

  • “Through post-incident reviews, teams look to understand more regarding contributing factors of incidents during remediation.”

Analysis

  • “Clarification that learning from an incident is the top priority helps to establish expectations for any analysis. Gaining a diverse and multifaceted understanding of the incident as it cycles through the phases helps to uncover new and interesting ways to answer the questions ‘How do we know about the problem faster?’ and ‘How do we recover from the problem faster?’”

Readiness

  • “We use the idea of ‘readiness’ rather than ‘prevention’ for this final phase because our focus is on learning and improvement rather than prediction and prevention. This is a distinction from the Dreyfus model mentioned earlier.”

Post-Incident Review

“By following a new approach to post-incident reviews, we can make our systems much more stable and highly available to the growing number of people that have come to rely on the service 24 hours a day, every day of every year.”

“A company that validates and embraces the human elements and considerations when incidents and accidents occur learns more from a post-incident review than those who are punished for actions, omissions, or decisions taken.”

“Post-incident reviews help to eliminate assumptions and increase our confidence, all while lessening the time and effort involved in obtaining feedback and accelerating the improvement efforts.”

“The primary purpose of a post-incident review is to learn.”

“A well-crafted post-incident review will uncover the improvements that can be made in team formation and response.” (Jason Hand, O’Reilly Media Post-Incident Reviews)

Proactive Incident Management

“With a proactive approach, organizations try to detect potential threats before an incident occurs. Files from unknown or suspicious sources can be quarantined automatically for investigation. A proactive approach also includes regular, planned hunting exercises to detect threats that may be lurking in the system but not yet detected or perhaps not yet detonated.” (DevOps.com)

Pull

“Configuring a health check endpoint, that a centralised tool pulls data from. When dealing with the ‘pull’ model you’ll hear people suggest that rather than a simple ‘200 OK’ response you should add extra information that gives humans more understanding of the overall state of the service.” (Observability and Monitoring Best Practices)

Push

“Sending metrics to an analysis tool.” (Observability and Monitoring Best Practices)

Reactive Incident Management

“Many organizations are successful at the integrated phase—and those currently in reactive see large improvements by focusing on being tactical.”

Consider these questions to establish which stage you’re currently in:

  1. Are communication and collaboration expectations outlined?
  2. Are the roles of every team member defined?
  3. Do you have processes in place for incidents—from occurrence to review?
  4. Is your team consistently learning and improving?

“If you can’t answer all of these questions, or if you answered ‘no’ to more than a few, your organization is likely in the reactive phase.” (Amanda Boughey, VictorOps)

Recovery / MTTR

“With MTTR, you measure the time it takes to recover from a production failure. It’s often a good way to measure the skill and flexibility of your organisation and it’s a measurement that should decrease with time.” (Solidify)

Regex

“A regular expression, regex (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for ‘find’ or ‘find and replace’ operations on strings.” (Wikipedia)

Root Cause Analysis

An outdated term used when teams search for a single source of a system error or outage. Now, more holistically utilized and referred to as post-incident reviews.

“RCA is a helpful framework that codifies an approach to understanding the underlying cause of a failure in a system. While inclusive of contributing factors, RCA encourages practitioners to ‘get to the bottom’ of a problem by a relentless, dogged hunt for a single causal entity. However, in complex (distributed) systems such as the ones our customers build and maintain… There is NEVER a single causal entity. Thus we suggest a different, more modern approach to incident retrospectives.” (Matthew Boeckman, VictorOps)

SLA (Service-Level Agreement)

“A service-level agreement (SLA) is a commitment between a service provider and a client. Particular aspects of the service–quality, availability, responsibilities – are agreed between the service provider and the service user. The most common component of SLA is that the services should be provided to the customer as agreed upon in the contract.”

“As an example, internet service providers and telcos will commonly include service level agreements within the terms of their contracts with customers to define the level(s) of service being sold in plain language terms. In this case the SLA will typically have a technical definition in mean time between failures (MTBF), mean time to repair or mean time to recover (MTTR); identifying which party is responsible for reporting faults or paying fees; responsibility for various data rates; throughput, jitter; or similar measurable details.” (Wikipedia)

Site Reliability Engineer (SRE)

“Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. The main goals are to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” (Wikipedia)

Scheduling for On-Call SRE

SOC (System on a Chip)

“A system-on-a-chip (SOC) is a microchip with all the necessary electronic circuits and parts for a given system, such as a smartphone or wearable computer, on a single integrated circuit (IC).” (TechTarget)

Stages of Incident Management Maturity

Reactive No stack awareness Poor collaboration Undefined roles Lack of remediation

Tactical Some defined process and roles Segmented roles Collaboration in crisis Understood protocols

Integrated Post Incident Review processes Triage documentation Collaboration across roles Consistent communication

Holistic

Self remediation Advanced metrics Increased empathy Consistent continuous learning

(Incident Management Buyers Guide, VictorOps)

SysAdmin (System Administrator)

“A system administrator, or sysadmin, is a person who is responsible for the upkeep, configuration, and reliable operation of computer systems; especially multi-user computers, such as servers. The system administrator seeks to ensure that the uptime, performance, resources, and security of the computers he or she manages meet the needs of the users, without exceeding a set budget when doing so.” (Wikipedia)

Triage

“The assigning of priority order to projects on the basis of where funds and other resources can be best used, are most needed, or are most likely to achieve success.” (Merriam-Webster)

Webhook

“Webhooks are such a powerful tool, allowing you to wire together products and create complex interactions – almost meta-programming. They don’t always fit together perfectly, but a simple relay like this gives DevOps engineers like you the ability to build the behavior that best suits your needs.” (Tara Calihman, VictorOps)

Don’t forget to check out parts one and two of The DevOps Dictionary as well:

Ready to get started?

Let us help you make on-call suck less.