How IT Operations Teams Can Reduce Outages in the 2020s

Enterprises are seeing a clash of two forces – more IT complexity and intolerance of downtime.  BigPanda’s Mohan Kompella says IT ops teams are turning to AIOps - modern AI and machine learning solutions - for richer insights that can reduce or avoid outages.    

Tags: ML, AIOps, BigPanda, downtime, outages,

Mohan Kompella, Big Panda
Mohan Kompella
vp product marketing
Big Panda

"Implementing an AIOps platform should not be such new technology that it requires disrupting existing tools, processes and workflows."

Intelligent Data Summit
Analytics, Apps & Data for Success in the Digital Enterprise
Online Conference

In the digital era, customers tolerate no downtime. 

That said, it can be quite a challenging feat to keep systems running with limited interruptions.


Organizations may struggle with cognitive overload and an overwhelming number of alerts (some real, some false-positive). Combine this volume and complexity with manual processes, and it all conspires against the ability of many talented from solving incidents quickly and efficiently. 


The result? Unplanned downtime and disruption. Worse, these downtimes and disruptions lead to significant losses in revenue. And worst of all, these downtimes and disruptions come at the worst time -- during times of increased demand and activity, such as Black Friday for retailers or the Super Bowl for broadcast and advertisers. 


There’s more to these issues than cost, though. Disruptions can also lead to customer dissatisfaction, unhappiness, and perhaps even fatal outcomes in this unforgiving business climate.


Increasingly, IT leaders can look to AIOps solutions for answers. These solutions leverage artificial intelligence (AI) and machine learning (ML) to intelligently streamline Ops and deliver crucial automation to the tasks of identifying, diagnosing and resolving issues. With the intelligence and speed of action of AIOps, it’s also not unusual for companies to actually prevent disruptions and outages. 


Now that we’ve reviewed the benefits that AIOps can bring to both IT and business, the rest of the piece will evaluate what to look for when it comes to adopting AIOps solutions for your business -- and important things you should consider before doing so.  


First, Look to Machine Learning

More data is being generated every day as IT complexity increases due to more environments combining cloud-based applications and services with on-premises data centers. As a result, more skilled IT Ops professionals are needed to process this information to guarantee 24/7 uptime. 


The challenge is that adding teams and training them is costly, time-consuming, and not always possible: There simply aren’t enough qualified professionals to go around.


Therefore, companies need a new level of automation to address the uptime and performance requirements of today’s dynamic applications and services. AIOps supersede manual processes and rules-based systems to help organizations achieve that level of automation. 

Leverage Existing Tools and Apps

Most organizations have likely accumulated many diverse IT Ops tools. These can include various flavors of monitoring tools for all the layers of your stack -- network, infrastructure, applications and servers.


The large amounts of data generated by the many different monitoring tools, and the difficulty of processing that data in real-time, often swamps operations teams. As a result, team members might miss critical alerts and incidents. The ultimate result is a greater likelihood of downtime.


The tools an organization already uses — including different helpdesk/service desk and ticketing tools, and a variety of collaboration and notification tools, among others— serve an essential purpose. 


Every mid-to-large enterprise that has been in business for 10+ years has a collection of application monitoring tools and MIB-style (SNMP) monitoring tools in place. Most of these tools deliver events (or SNMP traps, if we're referring to network equipment or other legacy applications) that are meaningful to the owners of those applications and devices.


Implementing an AIOps platform should not be such new technology that it requires disrupting your existing tools, processes and workflows. This would be counterproductive – and could even cause more harm than good. 


So, the best approach is to implement an AIOps platform that can easily integrate and work with all existing IT Ops tools. 


Use Automation To Get Your Teams on the Same Page 

Today, the challenge for many large enterprises is that, given the proliferation of modern infrastructure and the attendant monitoring tools (often 20-50...or more), IT Ops and NOC teams have to sift through huge alert volumes. Detecting an incident or outage in real-time or near real-time is exceedingly hard - even impossible. 


So, in this context of a blur of existing data. Having an application-specific or device-specific view means very little without the context of the surrounding networks, systems and applications.


This has introduced thousands of human processes. When outages occur, owners of these various monitoring tools often must come together in a situation room or a bridge call. Such meetings, virtual or in-person, can last tens of hours -- or even days. Teams painstakingly compare their event streams, discuss who changed what, establish innocence (or guilt) and eventually get to the root cause with manually in a grueling time-consuming process. 


AIOps offers a better, modern approach that efficiently shortcuts this difficult route. It takes the painful human/manual processes out of correlation. Done properly, AIOps is also very seamless – and not disruptive. 


It won’t require a ‘rip and replace’ of your legacy, homegrown or existing monitoring tools. Neither will it require you to invest in AI-enabled application-specific tools that still fail to solve all the issues you face when systems perform poorly, or outages occur.


Instead, AIOps lets you bring together all these separate bits of data so you can paint a more holistic picture of what’s going on – especially when things go wrong. 


You can run the event feed from each of your monitoring tools (networks, applications, services) through an event correlation engine, which is powered by AI. This approach means the machine sees all of the monitoring data you collect. It takes in data from all your hundreds or thousands of monitoring alerts. Using this data, it then finds what they have in common and creates a correlated, enriched and priority-tagged event.


Finally, the output of AIOps spells out which systems and services are impacted, what caused the incident (either a root cause change or other root cause) and suggestions for where to go to fix it.


 Proper AIOps solutions should be able to work with your existing tools seamlessly. These include the long-relied on APM and MIB-based ones. For legacy companies, the idea behind AIOps should be modernized and future-proof your stack, without disruption. 


The best AIOps for your enterprise should work with everything you have – and let you extract the most value from their data, and future proof your stack. So, regardless of how your tool stack looks two, three or five years from now, you're still able to rely on the AIOps solution for effective real-time incident management. 

Look To How You Measure Metrics and Reports

To achieve a timely incident response, IT Ops professionals must be able to access, view, and share information about KPIs, metrics and trends. The inability to pull together and report on historical IT Ops information from monitoring, ticketing, service desk and other tools leads to wasted time and inefficient incident responses. 


Well-designed IT Ops reporting software can identify infrastructure hotspots that create recurring issues. Such software can also help you understand the value created by different monitoring tools. Finally, because IT Ops, NOC, and DevOps teams’ productivity is essential, IT execs and business owners should utilize such tools to track, measure and improve that productivity.


Ask yourself the following questions when considering your reporting capabilities: 

How do you currently measure the effectiveness of IT Ops, NOC, and DevOps teams in detecting, investigating and resolving incidents and outages?


Which key IT Ops metrics and KPIs do you use to communicate the value created by your IT Ops and NOC teams to IT execs and business owners?

Rapid Time To Value is a Crucial Metric

In the digital era, speed and agility are of the essence. In the quest to keep systems running, organizations can’t afford the time and expense of an AIOps overhaul that would require costly and lengthy implementation services. Organizations must achieve rapid time to value.


IT Ops leaders and managers know they must improve their ability to respond to incidents, but the choices are difficult. Some solutions require on-premises equipment and the expensive integration help of third parties. 


Other solutions can be monolithic, but chances are good that some of the components or modules are substandard, delivering mediocre results overall.


The best approach is to implement AIOps as a (native) SaaS solution that smoothly integrates with existing best-of-breed tools. This ensures there won’t be a need for separate infrastructure investments or excessive external assistance.


The result is a rapid time to value and low TCO (total cost ownership).


Final Thoughts

IT leaders need to act, but at many organizations, IT operations are mired in the past. Legacy systems are not geared to today’s climate and the requirements of complex hybrid cloud environments. 


That’s why IT leaders should look toward AIOps solutions that leverage AI and ML to intelligently streamline and automate IT Ops. These four keys top-of-mind when considering which one to choose.


Mohan Kompella is the VP of Product Marketing at BigPanda. He's spent 15 years in IT Operations and ~ 20 years in Enterprise IT. During this time, he's worked closely with F1000 and Global 5000 enterprises in a variety of engineering, consulting, architecture, pre-sales engineering, product marketing and strategy roles. Most recently, Mohan led Product Marketing for Lightbend.