BigPanda AIOps Adds Root-Cause Changes To Speed Incident and Outage Resolution

BigPanda’s latest AIOps update aims to help teams more quickly detect and resolve issues that can impact both cloud-native and hybrid environments.  IDN speaks with company vice president Mohan Kompella.

Tags: AI, AIOps, analytics, BigPanda, incident, IT ops, machine learning, MTTR, root case, ServiceNow,

Mohan Kompella, Big Panda
Mohan Kompella
vice president for product marketing
Big Panda


"BigPanda in minutes can isolate the root cause change that resulted in an incident or outage. This can take hours - even days - with traditional tools and practices."

Intelligent Data Summit
Analytics, Apps & Data for Success in the Digital Enterprise
February 20, 2020
Online Conference

BigPanda is shipping its latest AIOps update, which aims to help teams more quickly detect and resolve issues that can impact both cloud-native and hybrid environments.

 

In specific, BigPanda’s helps IT Ops, DevOps team and NOC (network operations center) teams to detect incidents and outages, visualize them, identify their probable root cause, understand their impact on users and customers, and route them to the right teams for rapid resolution, all in real-time, Mohan Kompella. BigPanda’s vice president for product marketing told IDN.

 

“BigPanda in minutes can isolate the root cause change that resulted in an incident or outage, Kompella said. “This is a task that can take hours or even days with traditional siloed tools and NOC practices.”

 

Core to this outcome are BigPanda’s latest AIOps features -- Root Cause Changes, Real-time Topology Mesh and a new infusion of response-optimized AI/ML. With these updates. BigPanda can ingest three of the most critical datasets in IT operations: alerts, changes and topology.

 

[BigPanda’s RCC feature leverages the company’s Open Box Machine Learning and its Open Integration Hub technologies.]

 

“BigPanda’s new offering puts root-cause change behind an outage at the IT Ops teams’ fingertips, slashing mean-time-to-resolution and improving the performance of critical systems and applications,” said BigPanda CEO Assaf Resnick in a statement.

 

BigPanda’s improvements come as enterprises continue to feel the stress from using traditional legacy IT operations tools and root cause analysis techniques, Kompella added. “As companies move to hybrid or modern cloud architectures, legacy tools and practices are becoming ineffective,” he said.  In the past, traditional technologies worked because many problems were related to failures with on-prem, infrastructure, monolithic apps or hardware.

 

“Today’s enterprise architecture is much different. There is a lot of complexity with the IT stack, and it’s all moving very fast. So, for example, microservices, containers, et cetera all generate an enormous amount of data. We’re talking like 10, 20, a hundred, maybe a thousand times the volume of data that the old IT stack,” Kompella said.

 

Equally challenging. Kompella added, is just how quickly IT stack moves. “A retailer, for example, might be bringing up a hundred new web servers to handling load because there's the flash sale. And these 100 web services may be alive for only 10 minutes, five minutes – sometimes even just 1 minute before they disappear into the ether,” he said.

 

“This combination of complexity and rapid change makes it much more difficult to detect and deliver quick times to resolution,” Kompella added. To meet this challenge, BigPanda’s update leverages RCC, mesh technology and uses AI/ML to speed detection, investigation and resolution -- what BigPanda calls an “incident management lifecycle.” With AI/ML, BigPanda can correlate and analyzes vast volumes of data in real-time to deliver mean time to resolution (MTTR).

 

“The very first part of this - the detection – that’s where AI/ML for us plays a huge role,” Kompella said. Here is where BigPanda aims to aggregate huge data volumes from multiple tools across the stack and deliver fast results with a clear picture of what is going on.

 

“If we hear one thing from customers over and over, it’s ‘I have this complex hybrid IT stack [on prem and cloud native], and I don't have a single AIOps tool that can help me detect, investigate and resolve outage and incidents quickly in like near real-time.”

 

To address this point, BigPanda works with JIRA, Jenkins, Ansible, among other tools. It also has deep integration with ServiceNow. This attention to third party integration lets BigPanda ingest and bring together a vast volume of data from many different tools.

 

To further ensure a ‘clear’ picture, BigPanda’s Real-time Topology Mesh ingests topology data from cloud & virtualization management, service discovery, APM and CMDB tools to create a full-stack, always up-to-date topology model.

Big Panda

Once BigPanda has captured this data, its ML algorithms against the thousands or tens of thousands of monitoring alerts or events that may have come in, Kompella said. “And we then rapidly correlate them to see if there are patterns – and if they are related [to the same outage or incident], or not.”

 

Because of this approach, BigPanda “can become this one source of truth because we have this holistic view of every single change in your environment. So, we can say, ‘Out of these thousands of alerts that came in, all they’re really talking about is just these three incidents. And here they are. So go look at just these three areas,’” Kompella added.

 

From this initial detection or discovery phase, Kompella described how BigPanda lets users investigate/identify the cause – and ultimately to resolve the issue.

 

Once someone sees an incident, then they say, “Alright, what caused the incident? How do I investigate this?” BigPanda automatically [presents] what we think is the probable root cause for that with a very high degree of accuracy,” Kompella said.

 

To do this, BigPanda marries data-intensive analytics and AI. “So, for example, we can look at say 3,000 changes that happened last week and then tell you which top five changes [you made] likely caused this issue,” he said. “All of this is done with AI automatically without human intervention.” It does not require someone to go into a change management tool and manually looking at every single change record or change, he added.

 

This focus on data correlation, automation, AI and visualization means BigPanda users can avoid the dreaded “bridge call from hell,” which can last for hours, Kompella added.

Let’s say there is a security setting change that is needed. At first, everything looks green, but two hours in there is some unexpected behavior.

When such an incident occurs, a NOC user can go to a tab in BigPanda console called ‘Related Changes” and see all the BigPanda knows about this occurrence.

While the incident is happening (not later), BigPanda gives them a list of all the changes associated with that application or that service that they never had access to before.

Our ML also kicks in and says, ‘Alright, I've identified these three changes, including this one change that seems like a security config setting change. That change happened around the same time [as the incident]. I’ve seen this happen before, and therefore I'm now going to surface this.’

That’s a huge game-changer for these customers or these NOC people because earlier they were at the mercy of the change management teams or the help desk teams. But now they have all the changes right next to the incident.

With BigPanda’s RCC and Mesh updates, users can amass a deep, historical record of incidents – and how they correlate to one another, Kompella said.

 

“Let's say over three months, or let's say six months, you'll now have hundreds of thousands of changes and then hundreds of thousands of incidents that came come through BigPanda. They get correlated and matched and all of that. The result is users can ask questions like ‘What are the top five types of changes that caused my incidents?”

 

The updated BigPanda AIOps platform is available now.

 




back