VictorOps (now Splunk On-Call) is an incident management and IT alerting platform for DevOps and IT Ops teams. With Splunk On-Call, these teams can provide better and faster proactive support. As a result, they can improve cross-functional resolution collaboration, resolve incidents more efficiently and reduce time-to-acknowledge times. In addition, these teams can also benefit from VictorOps from its rich contextual notifications and greater visibility to critical incidents.
What is VictorOps?
VictorOps (now Splunk On-Call) is a real-time incident management response platform designed for DevOps and IT Ops teams. The platform collects the critical IT and DevOps systems data from monitoring systems to provide efficient incident response management through centralized information, automated alerts, and robust documentation.
VictorOps handles the entire incident lifecycle, from identification, logging, categorization, prioritization, diagnosis, escalation, response resolution, and closure. In addition, it also introduces vital features like context-rich alerts, intelligent routing, and on-call management to deal with the people. Finally, the reporting and documentation provide the necessary actionable insights to solve the problem and knowledge feedback for future incidents.
- Deployment Cloud-app, SaaS, and web-based interface
- Mobile Android and iOS
- Customer Support Email, Help Desk, Phone, Chat, and Knowledge Base
- Pricing model Prices start at $5 per user/month for the Starter edition (more on pricing below)
- Free Trial Α free trial is available for 14 days
What is VictorOps known for?
VictorOps is known for its exceptional incident response automation. Users can automate vital incident response processes, including escalations (through escalation policies), war room, and post-incident reviews. In addition, VictorOps also introduces on-call scheduling and automated rotations. This level of automation allows teams to focus on incident resolution and remediation processes.
VictorOps (Splunk On-Call) Highlights
Below are some of the critical features of the VictorOps platform.
- Native iOS and Android applications Receive, act, and resolve incident alerts right from an iOS or Android-based device. In addition, the mobile app comes with features like rerouting and alert snoozing alerts so that you can attend to alerts on the move.
- It improved incident reporting Splunk On-Call comes with valuable reports such as Incident Frequency, MTTA/MTTR, and Post-Incident Review. These reports allow a better and cleaner incident analysis to drive simpler and faster problem resolution.
- Contextual incident information Accelerate problem-solving times and accuracy with the contextual incident information that every alert remediation comes with.
- Seamless integration with existing toolkits The VictorOps platform integrates seamlessly with your current monitoring and collaboration systems. It comes with pre-configured APIs to integrate easily with Slack, ServiceNow, StatusPage.io, Datadog, AppDynamics, AWS, and many more.
- Transmogrifier An advanced alerts and incident management feature. With the Transmogrifier, users can customize and adapt their alerts to suit the team's needs.
- Recommended Responders Recommend the correct user and information with a Machine Learning (ML) engine that looks at resolvers and similar historical incidents. These recommendations help teams resolve incidents faster and with higher accuracy.
- Rules Engine A feature that delivers “extra” helpful information and resources to users when notified of an incident. The rules engine adds more context to incidents, along with resources like remediation documentation, runbooks, articles, etc.
- Robust collaboration and communication features
- Rich and seamless integration
- The mobile apps (Android and iOS) are fully developed
- Centralized view for incidents, timeline, and people
- Amazing customer support
- API needs some improvement
- UI can be clunky at times
- Difficulty in managing the calendars
- It can be challenging to override shifts and call management
- It can be more expensive than most competition
Getting Started with VictorOps (Splunk On-Call)
VictorOps recommends creating the Teams as a first step.
- To create a Team > go to Teams > Add Team. This menu item lets you configure user-member lists, on-call schedules, shifts, rotations, escalation policies, and scheduled overrides. Creating a team will make all communication, collaboration, automation, and scheduling tasks much more accessible. In addition, once you create your team, you can go ahead and invite users.
- You can add users from: Users > Invite Users > Input their email address.
- You can also add users by integrating an existing application through its API. To integrate an application, go to Integrations > API. Here you would need to provide the ID and key.
Once you have the teams and team members, it is time to build scheduled rotations and escalation policies. The rotations or on-call shifts are shared across multiple users. These rotations need to be tied to an escalation policy, which specifies how to treat an incident, for instance, which incidents to route, to whom, and how to escalate.
- To do this, head to Teams > On-Call Schedule > Set up Escalation Policy.
As the next step, set up routing keys. A routing key is an essential initial configuration step. It links an alert from a third-party monitoring tool to the specific user, team, or escalation policy. All kinds of alerts have routing keys. For example, routing keys in Splunk On-Call help get the right person to work on the problem, thus reducing alert noise for those users and teams unrelated to the incident.
- To set a routing key, head to Settings > Routing Keys > Add Key.
- Input a Name. It is recommended to use the team’s name or policy in charge of the alerts or the monitoring tool alerts source.
Navigating through the VictorOps console
The web console comes with seven sub-menus found in the top navigation bar; Incidents, Timeline, Reports, Teams, Users, Integrations, and Settings.
The Incidents View
The incidents view shows an update of all teams incidents. This list summarizes the triggered, acknowledged, snoozed, and resolved incidents.
To start with Splunk On-Call, it is essential to centralize all the alerts from data sources. We do this by integrating it with the monitoring and alerting authorities. With Splunk On-Call, all your integrations will feed the alerts page and create incidents.
The software can integrate with a diverse and large number of tools, including monitoring, DevOps, Business, and Security software.
- To see all the integrations, go to Integrations in the top navigation bar.
- Splunk On-Call recommends configuring non-alerting integrations like a chat before setting up alert integrations.
The Timeline View
The Splunk On-Call Timeline View is a single central dashboard. According to Splunk, the Timeline view “gives you a real-time firehose to give your team the full context during the firefight.”
The timeline view divides into people, timeline, and incidents. It allows you to keep up a well-informed on-real time status on the current happenings, browse through the timeline, and collaborate with people via the native and integrated chat.
The Reports Page
This page gives constant and updated documentation to keep track of all activity associated with a specific time, user, teams, or alerts, then generates customized reports. Such reports are vital to keeping track of team activity and its performance.
- To generate a report, go to the Reports page in the top navigation bar.
Splunk On-Call comes with four reporting features:
- Post-Incident Review This report shows you a historical insight into particular events specific to a singular incident or time range. If you have a similar problem, the post-incident review provides you with a well-documented account for correcting the problem.
- Performance (MTTA/MTTR) Report This report provides you with the complete picture of achieving quantified objectives. It uses critical metrics such as the Response Metrics MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve) to show overall performance.
- On-Call Report A report that provides an overview of the team’s workload and a detailed view of the individual’s workload.
- Incident Frequency Report A report that lets you and your team analyze the flow of incidents after they happened. The incident frequency report helps you move to the source of the problem to solve incidents with more precision.
VictorOps (Splunk On-Call) Pricing
The VictorOps, now Splunk On-Call software, is part of Splunk’s observability solutions. The platform is priced on a per-user/month basis. If you decide to pay annually, the price starts at $5 (Starter), Growth ($23), and Enterprise ($25). But you may also pay monthly, with different pricing, starting at $10 (Starter), Growth ($35), and Enterprise ($45).
Note that all the plans include the iOS and Android Mobile App. Below is a brief description of every plan:
- Starter Up to 10 users for collaborative incident response. It does not support ITSM Integrations (ServiceNow, JIRA).
- Growth Unlimited users. It does not support ITSM Integrations (ServiceNow, JIRA).
- Enterprise Unlimited users. Support ITSM Integrations (ServiceNow, JIRA). Includes Machine Learning to identify similar incidents and suggested responders.
Free Trial? Register without a credit card to try VictorOps (Splunk On-Call) free for 14 days.
Top VictorOps Alternatives List
Below are the top alternatives for VictorOps. These alternatives are similar incident management and response software targeted for DevOps, SRE, and IT Ops teams, with robust integration capabilities. In contrast, other VictorOps alternatives described below are the monitoring, observability, or AIOps tools that come with incident management capabilities.
Datadog is an Infrastructure and Application Monitoring SaaS-based solution. It provides monitoring for the entire application stack (on-prem or cloud) at any scale— all through a SaaS-based data analytics platform. In addition, Datadog also provides robust incident management and error tracking. Register for a 14-days Free Trial!
PagerDuty is an incident management and response platform for real-time operations. It integrates machine data with human intelligence to enhance the organizations' visibility. PagerDuty also provides on-call management, automated incident response, runbook automation, event management, and operational analysis. Sign up for a 14-days free trial.
Opsgenie by Atlassian is a modern incident management platform. It integrates with third-party monitoring systems and applications to receive alerts. In addition, Opsgenie comes with features like on-call scheduling, escalation policies, actionable and reliable alerting, and advanced reporting and analytics. Sign up for a 14-days free trial!
xMatters is a service reliability platform targeted to help DevOps, SREs, and IT operations teams ensure their applications are up and running. The xMatters platform includes automated incident management, on-call management, alerts intelligence, workflow automation, and analytics. Sign up for a free trial.
Freshservice is an intelligent and unified cloud-based service desk and IT Service Management (ITSM) solution. It is designed with ITIL's best practices to manage IT operations problems and service requests more efficiently. Freshservice is also easy to use, set up, provides a diverse and broad integration list, and comes with robust incident management features. Sign up for a 21-days free trial.
6. New Relic One
New Relic One is a robust cloud-based observability platform that helps DevOps teams collaborate and solve problems. New Relic focuses on performance and availability monitoring. It keeps track of applications (APM), network, infrastructure, log management, and more. In addition, New Relic also allows incident orchestration and response with AIOps. New Relic offers a perpetual free limited license.
Moogsoft is an AIOps and observability platform for DevOps, SRE, and IT operations teams. This tool provides a central platform for all data sources and monitoring tools. It collects raw data, such as metrics and notifications, then correlates and normalizes it into a list of incidents. Sign up to get a 14-days free trial.
BigPanda is an IT Ops event correlation and automation platform powered by AIOps. It aggregates, correlates, and normalizes data gathered from different monitoring and observability tools. Then it uses AI and ML to correlate data into actionable insights. The tool also provides automated root-cause analysis and level-0 automation.