AIOps: Create a Closed Loop Support System to Streamline IT



In five short years, aartificial intelligence for IT operations (AIOps) has grown from a futuristic concept to standard practice for companies that place great importance on getting ahead of the IT support troubleshooting model.

AIOps offers a solution for several sources of stress that IT operations (ITOps) face today. IT environments are becoming too complex to be operated manually. The breadth of technology that ITOps must adopt is growing exponentially. Computing power moves outside the data center, to the edges of the network, and IInfrastructure problems must be solved at ever increasing speeds. Rather than trying to overcome these trends, companies are launching automation on the problem – using big data analytics, machine learning, and other AI technologies to help identify and solve problems. computer science. Enter AIOps.

AIOps typically has four basic steps: monitoring, analysis, recommendation, and remediation. While monitoring and remediation are important bookends, the middle two stages – analysis and recommendation – are key components that IT support providers must master to execute a successful AIOps strategy.

The objective is to identify emerging problems and apply corrective actions to the customers. This reduces the time it takes between identifying the fault and resolving it, not only for customers who have the problem, but for any other customers who may be at risk – or who have not yet identified the problem themselves – same.

However, given the pace of change in a typical ITOps environment, IT support providers need a continuous improvement cycle that can adapt in real time based on evidence-based experience to create a successful approach. to meet customer challenges.

To achieve this, service providers must identify best practices, through adopting recommendations, identifying gaps and ultimately defining “well known”. With a large customer base offering recommendations, potential issues can be identified as well as operational behaviors that improve IT efficiency. These elements then become the basis for preventive recommendations.

Start the process

This cycle of continuous improvement begins at the crossroads of product engineering and support, focused through the prism of high case management. To be successful, it must prioritize, identify and eliminate issues that require human intervention.

To do this, service providers must have effective telemetry monitoring, dashboards, and data analytics capabilities to track these trends. Strong product engineering, support engineering, and data science teams are needed to analyze telemetry at scale to identify new threats, prioritize them, refine rapid diagnostic capabilities and ‘isolate the causes. AI tools help with the volume of data to improve accuracy, ultimately enabling predictive identification issues before they can disrupt the customer. Customers can then receive corrective actions to resolve issues before a significant disruption to their environment.

It begins to describe the components of a continuous improvement cycle. Successful service providers must constantly do three things: monitor the health and performance of their installed base, develop new detection models, and provide recommendations to customers. They must be able to solve the problems of “patient zero” – the original customer who had the problem. Since all customers send telemetry data, using pattern recognition they can proactively identify and assist customers who have the same risk profile before these issues impact their ITOps.

Simple, common IT problems can happen 80% of the time and only cause 20% of the pain because IT knows how to handle them. These problems are best served by good analysis and automation alone. The benefit of AI is being able to identify and solve complex problems that may occur more infrequently – say, 20% of the time – but cause 80% of the pain – without the benefit of AI. to quickly identify and correct.

Turnaround time is a valuable consideration. What took months to diagnose and repair on a large scale can now be done in days, if not hours. For example, if a customer has a problem in Germany based on a specific setup, how long does it take for the organization to identify the problem? How long to confirm that this is a unique problem, and reactively quantify and identify this problem in other environments? How long to proactively apply patches or make recommendations to these environments to mitigate this impact? Last but not least, what does it take to avoid the risk in the first place? The use of large-scale AIOps telemetry provides a method to speed up identification and improve the accuracy of recommendations.

Think global, act local

Using telemetry in this way is a good example of thinking globally and acting locally. You can take advantage of all of these customer experiences, using their equipment and the services of their supplier, and create a bigger picture of what’s going on. You can observe what customers are doing and what issues are happening, and then use the data to actually support a number of those decisions. The supplier can then prioritize the risk in his customers and take targeted measures.

The aim of the approach is to anticipate problems and give customers an overview of potential problems existing in their environment, as well as options to avoid those risks. If problems can be identified preemptively, customers can make informed decisions and control risks.

Much of the information uncovered through an AIOps process can help customers resolve issues directly. When problems can be avoided through use, preventative recommendations backed by factual reasoning empower IT to drive change. If the resolution requires product enhancements to resolve the issues, these can be fed into product lifecycle development to resolve the issue, or at a minimum allow for better identification and prevention.

What does tomorrow look like?

Most of what we’ve discussed here can be viewed on a discrete system-by-system basis. You have a server, it sends its telemetry, and it sends back its recommendations. However, commercial success is no longer tied to monolithic systems. Interoperability between multiple systems, virtualization, applications and user experience now define IT. To increase agility at all levels, analysis must occur not only at the discrete system level, but also at the IT environment level. Along the stack, telemetry is needed not only to identify new threats, but also to determine best practices. This is where AIOps becomes increasingly important as it can perform with the speed and range that an engineering team could never match.

Applying AIOps to groups of machines and, by extension, to groups of systems, and ultimately to an entire customer base, offers multiple perspectives. Smart organizations can correlate this data and apply it to the whole concept of interoperability. Separately, moving up the stack and into the app provides information on how the app actually engages and interfaces with all products. This will enable new ways to optimize applications based not only on how a customer uses it, but also how entire customer bases are using technology globally.


The path to best practices will be better defined. The use of evidence-based analytics enabled at scale through AI will create an opportunity to build resiliency in IT environments. As the AIOps continues to mature, the breadth of perspective will create reliable “known good” paths for suppliers and customers. Today, improvements in tools and data security now ensure that the benefit that AIOps can bring weighs heavily in favor of streaming machine telemetry data, as many IT issues become “optional.” “.

Service providers will remove complexity from the customer and make better recommendations to increase predictability and ease of use. The development of successful AI-based solutions often relies on data collection. Service providers who have both access to telemetry data from a large installed product base and the reach of a strong support services organization will have a significant advantage.

Customers can already benefit from being part of a large-scale connected community with predictive AIOps. AIOps has come a long way in five years. Expect it to continue to develop in the years to come.

For more information, please visit


About Duncan Goode

duncan goode
Duncan Goode is a Global Services Product Manager for HPE Pointnext Services. Its goal is to ensure a quality support experience that drives better business results for customers. Duncan has worked in technology and support services for 30 years, providing leadership and innovation in a variety of roles in global support, mission-critical and retail environments. Based in Australia, he enjoys spending time playing and coaching cricket.

About Jordan Lewy

jorden lewy2
Jordan Lewy is Global Head of HPE Pointnext Support Services. In this role, her goal is to transform the HPE customer support experience using HPE InfoSight, which in turn drives their business results and enables their digital transformation journey. Jordan brings to his position a well-established experience in information technology and professional services, where he has worked for over 20 years. Prior to his current role, he held other positions at HPE including leading Storage Support Services, HPE Installation and Technical Services, and HPE Customer Technical Education business.

Copyright © 2021 IDG Communications, Inc.


Leave A Reply

Your email address will not be published.