More than just acking.

The ITIL calls it the Incident Lifecycle. You might call it a night without sleep. Either way, chasing down an alert to final resolution can be grueling, and that's why VictorOps stays with you through the entire incident.

We provide teams with a virtual environment where they can prepare for, react to, and recover from each incident regardless of location or device.

Scroll to see what real-time incident management looks like with VictorOps.

VictorOps integrates with your existing enterprise monitoring systems.
VO-lightning-1 VO-dumb-alert-1 VO-lightning-2 VO-dumb-alert-2 VO-lightning-3 VO-dumb-alert-3
Information about your infrastructure, service delivery problems, or any unique situation is ingested into VictorOps.
As this data enters into VictorOps, your team has increased situational awareness so that collaboration and problem solving can happen in real-time.
VO-lightning-4 VO-lightning-5 VO-lightning-6
Once VictorOps consumes the alert, it begins the custom paging process you have predefined.

Alerting:

Custom and Intelligent Routing

Smart Alerts

Programmatically alert on-call team members to issues that need attention.
Unique to VictorOps, alerts intelligently route to the specific team members based on the content of the alert itself.
Alerts become data rich as contextual information (runbooks, graphs, reports) is surfaced within the alert.
VictorOps Rich Alerts 3 VictorOps Rich Alerts 2 VictorOps Rich Alerts 1
Actionable incidents flow to the right teams or individuals for real-time response (via SMS, email or phone call).
System data, along with team interactions, now flow alongside each other in the VictorOps Incident Timeline.
VictorOps Rich Alerts 2
VictorOps Rich Alerts 3
VictorOps Rich Alerts 1
Via the timeline, people are mobilized in a virtual environment where problems are solved in real-time as a single collaborative team, regardless of role or device.

Triage:

Incident Timeline

Situational Awareness

Chat-like and Voice Collaboration

Investigation:

Identification:

Resolution:

Teams now have a single view of all activities surrounding the incident – alerts, paging, and messages. Alerts can also quickly be rerouted to a different person or team based on content or severity.
Easily assemble team members for a Control Call (voice conference) if more synchronous communication becomes necessary.
Share posts containing relevant remediation information or updates directly into the timeline to keep team members up to speed on steps being taken toward resolution.
Timeline
Control Call (888)555-1974
STATE RESOLVED
TIME 6:17 am
MSG STATUS 200 http://www.feedthemeter.com ok

INCIDENT: #428 was RESOLVED for (Alert - http://www.feedthemeter.com not responding) by (sysMaggie)

devMolly: woot! bob ftw!!!

CONTROL CALL: "Incident #428" ended

CONTROL CALL: @backendBob left "Incident #428"

CONTROL CALL: @sysMaggie left "Incident #428"

CONTROL CALL: @frontalNed left "Incident #428"

CONTROL CALL: @devMolly left "Incident #428"

devMolly: Website is back up. Server reboot was successful.

hubot: server3 reboot complete

hubot: rebooting server3

backendBob: @hubot reboot server3

devMolly: Server3 is indeed down. @backendBob is going to reboot.

devMolly: I’ll take notes from the conversation and post them to the timeline.

CONTROL CALL: @devMolly joined "Incident #428"

CONTROL CALL: @frontalNed joined "Incident #428"

CONTROL CALL: @backendBob joined "Incident #428"

CONTROL CALL: @sysMaggie joined "Incident #428"

CONTROL CALL: "Incident #428" started by @sysMaggie

sysMaggie: @bob I’m seeing a bunch of "Error 559ers." Is Server 3 offline?

SYSTEM: User sysMaggie routed incident #428 to Backend

Pug_3D
STATE UPDATE
TIME 6:14 am
MSG this pug is hilarious

frontalNed: I wasn’t able to ssh. I think the server is down.

sysMaggie: @devMolly I’ll update your SSH permissions.

Twitter conventions like @messaging help pull other team members into the fire fight.

devMolly: No response from the server.

INCIDENT: #428 was ACKED for (Alert - http:feedthemeter.com not responding) by (sysMaggie)

PAGING: (sysMaggie), for #428 via (email), via (push)

PAGING: (devMolly), for #428 via (sms), via (push)

PAGING: (frontalNed), for #428 via (email), via (push), via (sms)

STATE PROBLEM
TIME 6:03 am
MSG ERROR 501 http://www.feedthemeter.com not responding
Once an incident is resolved all the activity, data, and posts are stored from the timeline for re-use and future learnings. Reports are easily created to ensure teams can adapt and learn from each incident.

Documentation & Continuous Improvement:

DevOps Reporting

Transmogrifier

Post-Mortem Reports

Using the Post-Mortem tool, users can pull a section of the timeline for use in retrospectives and reporting on SLAs for internal and external constituents.

Post-Mortem Reports

Facilitate discussion around whether all alerts are actionable and if so, whether the runbooks and triage documentation were up to date.

Reporting

Other reports measure incident metrics, trends and MTTA/MTTR, all with the goal of improving your ongoing DevOps process.

The Transmogrifier

The future of DevOps alerting. This rules engine gives you more control of future incidents.
Maintain continuous documentation – Embed contextual information (triage docs, runbooks, Graphite graphs, notes, etc) so you can solve the problem faster.
Quiet Alert Noise – Change the state of your alerts based on specific parameters.
Page the Right Person, Every Time – Assign a route key based on alert content for efficient delivery.
Transmogrifier Reporting Post-Mortem Report
Incident #428

When state_msg matches *Staging*

Annotate the alert with:

NOTE Staging Alert: no paging

Transform these alert fields:

Set routing_key to new value database

When entity_id matches db* is down

Annotate the alert with:

NOTE Staging: Changing alert type to warning so no one will be paged.

Transform these alert fields:

Set alert_type to new value INFOAttachRemove

When host_name matches *.acmeco.com

When state_msg matches *HUBOT*

16:04 May 09, 2015 INCIDENT: #428 was OPENED for SERVICE (Website not responding)
09:37 May 11, 2015 @sysMaggie opened incident responding to alert. Save
16:05 May 09, 2015
STATE UPDATE
TIME 6:14 am
MSG this pug is hilarious
Pug_3D
09:37 May 11, 2015 Save Remove
16:06 May 09, 2015

SYSTEM: User sysMaggie routed incident #428 to Backend

09:37 May
11, 2015
@sysMaggie Reroute to backend restart server02. Save
Post-mortem report
Transmogrifier
Incident #428

When entity_id matches route change

Annotate the alert with:

NOTE HUBOT COMMAND: hubot restart apache on host 192.168.119.19

Transform these alert fields:

Set alert_type to new value INFO

When routing_key matches Twilio

Annotate the alert with:

NOTE Staging: Changing alert type to warning so no one will be paged.

Transform these alert fields:

Set message_type to new value INFO

When host_name matches *.acmeco.com

When state_msg matches *HUBOT*

We're the Swiss Army Knife of DevOps.

We combine data, alerting and remediation functionality into one elegant, simple web and mobile platform.

Join companies big and small, and the thousands of users who trust VictorOps to help make uptime the new normal.

Top
Try us Free Join a live demo
Live Chat

Stay mobile.

With a swipe of our native Android and iPhone mobile apps, you’re up to speed instantly with what your infrastructure and systems are doing. And when your infrastructure stops doing what it’s supposed to do, you can see exactly when and in what context the whole scenario went down.

Get it on Google Play or download from the iTunes app store.