Jerry Pommer - February 04, 2015
Let’s define a few things here.
An annotation is simply a small bit of information that you can add to an alert that matches this rule. It can be a URL, an Image link, or a Note (just a blob of plain text). Useful examples are a runbook/triage document link or Graphite graph link. Sometimes just a note like “You can’t fix this - call Mike at 888-555-1212” is all you need.
A transform is a way to change alert data on the fly. By typing the name of the field in your alert that you want to affect in the Transform’s ‘alert field’ box, and the new value you want to set in its ‘new value’ box, this rule will change the incoming data field of that alert to whatever you have entered here (think changing the state of a less-important alert from ‘CRITICAL’ to ‘WARNING’ under certain conditions, or rerouting on the fly). Transforms can also add new fields to alerts that didn’t have them before.
You can add as many annotations, or apply as many transforms to your alert as you like.
The stop flag is applied if you check the “Stop after this rule has been applied” option. Ordinarily your list of rules will be evaluated in order and applied one after the other if they match an alert. That means you can have several rules match different data within an alert and annotate/transform it with different information (more on this later). With the stop flag set, this rule will apply its annotations and transforms to a matching alert, and break the chain. The server will stop processing immediately and ignore any rules after this one.
Here is an example rule to give you some ideas. Let’s say we have an army of database servers being monitored, named db0, db1, db2, and so on, and we’d like to annotate any alerts having to do with those hosts with a load statistics graph image and database-server-specific triage document link:
This rule illustrates some of the more powerful features of the Transmogrifier. Start with the matching condition at the top:
When host_name matches db*.victorops.com
Note the use of the ‘’ in the value field. You can use the ‘’ and ‘?’ characters for simple wildcards, representing any string of characters or any single character, respectively. The wildcards can go anywhere in your matching string, as many times as you need them. The above example will match ‘db0.victorops.com’, ‘db-155zed’, or ‘db_zombiestein.victorops.com’. If we are only interested in ‘db0’ through ‘db9’, we could use the ‘?’ character instead of the ‘*’.
Next, we have filled in the annotations form fields with URLs and labels to create the ‘Runbook’ and ‘Load stats graph’ links shown:
The interesting thing to note about the Load stats URL is the use of $ as part of the URL target. host_name is a field from your alert, and by wrapping it in the dollar-double-curly syntax above, it will be expanded by the server at alert processing time into whatever value that alert field actually contains.
In our case, this rule is going to match an alert from our host named ‘db7.victorops.com’, and producing a link to a host-specific load average graph, like this:
Cool, huh? The Runbook link, for now, is just to a wiki page documenting general things to check out on a database server. I’ll show you how you can get more specific in the next example, but first take a look at that transformation:
Set MATCH_KEY to new value $::$::$
This is telling the server to create a new field on the alert called MATCH_KEY, since MATCH_KEY isn’t a field we will normally see in one of these alerts. When it does that, the server will also expand those variables used in the value field to insert the hostname, service description and state from the alert into out new MATCH_KEY field.
Why? Wait for it…
Now let’s create another rule below that first one:
This time, we’re looking for alerts with a field called MATCH_KEY, containing the string “db7.victorops.com::Database Load::CRITICAL”. That’s exactly what our first rule will add to the matching alert, and rule two will match it if and only if the affected host is ‘db7.victorops.com’, it’s a ‘Database Load’ problem in a ‘CRITICAL’ state. Now this rule has a very clear picture of what the problem is and can be used to do specific things. Here it is in the edit view:
In this rule, since we now know exactly which host it is, we can override the Runbook link with one that is specific to the care and feeding of db7:
We could have done this with the $ variable in the first rule too - but this demonstrates some important concepts:
It is possible to modify the contents of an alert with one rule, such that one or more subsequent rules will also match it and provide more helpful information, and;
by using tricks like the MATCH_KEY example, you can make more complicated and specific matching rules to really drill into the alert. A rule can only match on a single field, but the above trick consolidates the values of three fields into one so a single rule can fire if all three of those conditions are present.
Annotations and transforms having the same name in several rules can overwrite one another. The Runbook link in this rule will override the Runbook link applied by the first rule since it has the same name (“Runbook”). Also, the Load average graph image link applied by the first rule will pass through undisturbed, giving you the annotations applied by both rules.
The transform on this rule changes the value of an existing alert field instead of creating a new one as done in the first rule. This one changes the alert’s routing key on the fly, paging the ‘ops3’ team, who is especially interested in the load problems we’re having on db7.
This rule also has the stop flag set. Since by this time we really know a lot about what is wrong, we don’t need it to process any more rules after this.
Using this pattern, we could continue to have some very specific rules for certain hosts, or apply different graphs or documentation to alerts from a ‘db’ host about disk space instead of database load, and so on.
With the combination of wildcard matching an alert field, variable expansion and transforming alert fields on the fly, you have a powerful set of tools to manage alerts in many interesting ways. Here are some suggestions:
Transform a warning or unknown alert to critical:
Or, a too-chatty critical alert to a warning and stop paging people in the middle of the night!
Transform an email alert from the global default to a desired state, based on the content of the email body:
If your team has embraced ChatOps, you might want to experiment with things like this:
…which could very well do the job for you and let you go back to sleep!
You may want to temporarily “turn off” a rule without deleting and having to recreate it. This can now be done through the options menu on each rule. Disabled rules remain in place, displayed in gray, and are just skipped over by the server:
Also in the options menu of each rule is a ‘Preview” option. This will show you a sample of the most recent alerts from your timeline that are matched by the rule:
In this example, we have two alerts in our timeline that are matched by this rule. The orange highlighted field is the one that it matched on. The preview also works when editing a rule, to help verify the accuracy of your matching criteria:
Here the field it matches on is again highlighted in orange. As you move the mouse pointer over the other fields, they are also highlighted in yellow. Clicking on a field will copy its values into the rule, allowing you to create a rule directly from alert data:
Now the matching fields of this rule have been replaced with the fieldname/value info that was clicked in the alert data, and the preview updated to show alerts affected by the new matching criteria. Typing in the fields with the preview on also updates the preview as you type to show you the results of your change.
The Transmogrifier is a powerful feature that opens up a new realm of possibilities for managing alerts and incidents. We do want to warn users to proceed with caution; it only takes a few added rules to change a critical alert into an informational one. What our beta users have overwhelmingly found is that the transmogrifier allows you to further tailor and tame your alerts, customizing our tool to best suit the needs of your team.
There we go, giving you the documentation you need, right when you need…making on-call suck less again.