What to Look for in a Predictive IT Analytics Solution
By Stephen Dodson, Ph.D, CTO at Prelert
Keeping mission-critical software and infrastructure up, running, and performing has become increasingly difficult and complex in an IT environment of composite Web applications with scores of dependencies. Today, application performance and uptime depend on hundreds of delicate interactions among Web, application, and database servers, as well as legacy applications, any number of internal and external services, and all the hardware and infrastructure on which they run. When a performance issue occurs, it can be a tremendous challenge to isolate the root cause and repair the problem.
However, a new crop of predictive analytics solutions for IT operations promises not only to speed up root cause analysis; it also shows great promise for predicting performance issues before they’re even noticed by users and providing the information for IT to correct them before there’s any business impact.
Most predictive IT analytics solutions use advanced machine learning and big data analysis techniques to understand and monitor all those delicate interactions and relationships. Then they baseline typical behavior and alert IT only when an actual performance problem is likely.
Typical of an early market, however, there are many different solutions that do similar things in different ways. It can be difficult to understand what to look for or which solution will fill your needs best.
The Case for Predictive IT Analytics
As any IT person knows, the volume of log and other monitoring information produced by IT systems today can be overwhelming. Application performance management (APM) tools have helped make some headway in understanding and monitoring performance from an application point of view. However, when a slowdown or outage occurs today, tracing the root cause is still the most frustrating, overwhelming and time-consuming part of the discovery and repair process. It usually requires assembling “war rooms” of expensive application and infrastructure experts who spend hours manually poring through APM and other management logs to discover and understand the root cause. Frequently the root cause is never found, so IT has to devise time-consuming workarounds to attack the problem every time it comes up again.
A recent survey by TRAC research found that even with APM solutions, 60 percent of organizations report a success rate of less than half when it comes to preventing performance issues that have an impact on end users. It also found that on average 46.2 hours are spent each month on “war room” scenarios. When you consider the time spent and the lack of success, it’s clear that new tools are required.
Aside from APM, there are hundreds of other infrastructure monitoring tools that can each provide information on one component of the Rube Goldberg contraption that makes up typical application functions and transactions. However, they’re not very useful unless all the interactions and relationships among the pieces are fully understood. IT may set up individual KPIs and thresholds, such as CPU utilization percentages, to provide alerts, but KPIs are usually subjective, narrowly focused, and less-than-perfect indicators of application performance issues. In essence, IT suffers from too little and too much performance information at the same time.
A relatively new entry in the Big Data APM market, Splunk Enterprise® provides a great tool for aggregating and indexing all that log information and running Google-like searches that can help isolate root cause. Splunk addresses part of the problem by sorting and indexing useable information on the massive amounts of machine data it scans, but it was not intended to perform predictive analytics on IT data, which provides a huge opportunity in the market for innovation.
Predictive IT analytics take infrastructure monitoring, APM, and solutions such as Splunk to the next level by using advanced machine learning, pattern recognition algorithms, and big data analysis techniques to speed up root cause identification and take human subjectivity out of the equation.
Most IT folks are aware of predictive analytics tools that use advanced algorithms to understand consumer behavior for marketing purposes, credit scoring, or loan approvals. Predictive IT analytics apply similar techniques to the flood of IT-generated machine data to discover the hidden patterns and relationships that drive application performance. Instead of focusing on one or two subjective KPIs, the best predictive IT analytics solutions automatically discover and analyze all the hidden critical event chains and interactions among servers, applications, middleware, and other services and infrastructure that drive applications. They then baseline normal IT operations and alert IT to changes or anomalies that are likely to affect those interactions and relationships.
The best solutions can discover and address problems and their root causes many times faster than a room full of experts. But aside from their powerful forensic capabilities, effective predictive analytics solutions can also discover and diagnose small, hidden performance abnormalities that users may not notice, so small problems can be corrected before they become big ones that have business impact.
For example, a financial document management application suffered a performance collapse one morning as thousands of North American users were logging in. When the IT team managing the application applied a predictive IT analytics solution to the data gathered before, during, and after the incident, it found that the solution would have discovered the problem the night before the slowdown by detecting a small performance wobble as APAC and EU users logged in. It would have then traced the issue quickly to a database configuration change made a short time before. This means that IT would have had hours to fix the issue before there was any noticeable business impact.
What to Look For
What are some features you should consider when looking for a predictive analytics solution?
A Holistic Approach This may sound counterintuitive, but a truly effective predictive analytics solution should not focus on what are generally thought to be the most important KPIs and thresholds driving application performance. The best predictive analytics solutions move beyond human expertise and biases to discover, learn, and address all the components, relationships, and interactions —hidden and obvious — that affect application performance.
Mapping to the IT Environment rather than Vice Versa In the same vein, an effective predictive analytics solution should be flexible enough to learn and adjust itself to your IT environment quickly and painlessly. It shouldn’t ask you what to monitor or what thresholds to track; it shouldn’t require days, weeks, or months of user configuration; it shouldn’t make assumptions based on vendor research and biases; and it shouldn’t monitor at preset time intervals. It should harness its machine learning capabilities to monitor your environment continuously in order to understand how things interact at all times and what constitutes normal and abnormal behavior in your particular environment. When the inevitable changes happen, an effective predictive IT analytics solution should understand them and adjust on its own, rather than requiring time-consuming manual user readjustments and reconfigurations. In other words, really intelligent predictive analytics should tell you what to look for, not the other way around.
Emphasis on Relationships and Event Chains, Not Components You’ve had enough of endless infrastructure, server, and other component logs and constant false alerts about this or that KPI. An effective predictive IT analytics solution should move beyond all that to understand the actual delicate relationships and event chains important to transaction and application performance. And it should do so across software, hardware, operating systems, network infrastructure, and internal and external services. Once it understands your environment, it should alert you to important changes in the relationships and event chains that really do affect application performance. Alerts should never be based solely on a single KPI.
Work to Value Ratio If a predictive IT analytics solution follows the rules outlined above, then installing, configuring, and maintaining it in order to gain maximum value should require very little time and effort. It should take minutes and begin giving you useable information in hours, rather than taking days, weeks, or months to get up and running and provide some value. It should adjust itself to any changes in the IT infrastructure. You don’t want a product that without major effort and reconfiguration may become relatively useless in a year. A perfect example is replacing legacy servers that cannot perform well at more than 60 percent utilization with more powerful servers that can perform effectively at 90 percent utilization. You shouldn’t have to reconfigure the solution to reflect that new reality.
Eliminate the Unimportant Rather than asking you what is important, an effective predictive IT analytics solution should allow you to configure it to filter out data, such as domain names, that you know for sure are not relevant to performance.
Does It Work? Apply your prospective or existing solution to data from past IT incidents and new incidents that come along. Does it actually reduce the time it takes IT to diagnose and resolve IT incidents and issues? Is it detecting issues and helping to resolve them before they affect application performance? Has it provided insights into your IT environment and infrastructure relationships that you didn’t have before? Or does it require so much work to use that whatever value it provides is offset by the time and effort it takes to get and stay there. Is IT saving money and staff time? Are there fewer incidents? Are you still relying on workarounds or have you found and resolved root causes that you couldn’t find before? How much have you reduced your mean time to diagnose?
Most IT shops will find predictive IT analytics very useful in their quest for 24/7 uptime and performance. The return on investment can be quick and dramatic. However, it’s important to choose a predictive IT analytics solution that does more work than it requires for installation, configuration, and maintenance. Choose a solution that does most of the work of understanding your IT environment for you and then gets to work predicting and solving performance issues today and tomorrow.
Stephen Dodson Bio:
Stephen Dodson, Ph.D, is Chief Technical Officer at Prelert, the first company to provide 100% self-learning predictive analytics solutions to address the volumes of data generated by today’s IT systems. He was a founding member of the Riversoft engineering team, with which he led the design of the topology-driven root-cause analysis technology used today within IBM Tivoli Netcool, HP OpenView and Cisco management tools. Steve was also on the founding team of Njini, where he served as CTO before the company was acquired by Riverbed in 2008. Previously, Steve worked in the Computational Mechanics group at Imperial College in London, where he delivered key contributions to the field, resolving scalability issues using a novel approach to solving Maxwell’s equations which allowed it to become a practical technique used today by major companies.
 TRAC Research, “Improving the Usability of APM Data: Essential Capabilities and Benefits,” April 2012.