Skip to content

Root cause analysis

Introduction

Anomaly detection (AD) and root cause analysis (RCA) in multivariate time series refer to recognizing abnormalities in data and identifying their root causes. RCA without the anomaly detection underpinning it loses its meaning, just as anomaly detection without the RCA explaining it loses a lot of its explainability and actionability. Detected anomalies can be interpreted as suggestions of where to look into the data - suspicious data points to check - and the corresponding root causes can then be seen as the directions in which to ask questions - potential explanations to analyze. The combination of AD and RCA can help analysts to take critical decisions and prioritize their limited attention on the most valuable and impactful answers their data can provide at each point in time.

For meaningful RCA, the system should capture relationships between time series, be robust to overfitting and noise, deliver a transparent, explainable model and return varying levels of anomaly scores based on the severity of different incidents to dispatchers. TIM overcomes these challenges jointly.

Output

The AD root cause analysis output section is located in the anomaly detection outputs overview.

In the example below, the columns _"Influencer 1", _"Influencer 2", _"Influencer 3", _diff_"Influencer 1", _diff_"Influencer 2" and _diff_"Influencer 3" are shown at the end of the table, representing the root cause analysis output:

timestamp model_index ... normal_behavior diff_normal_behavior _"Influencer 1" _"Influencer 2" _"Influencer 3" _diff_"Influencer 1" _diff_"Influencer 2" _diff_"Influencer 3"
2020-10-12T03:00:00.0 1 ... 10 3 4 3 3 2 1 1
2020-10-12T04:00:00.0 1 ... 15 5 6 2 7 2 -1 4
2020-10-12T05:00:00.0 1 ... 8 -7 3 2 3 -3 0 0
2020-10-12T06:00:00.0 1 ... 12 4 5 4 3 2 2 0
2020-10-12T07:00:00.0 1 ... 9 -3 2 4 3 -3 0 0
2020-10-12T08:00:00.0 1 ... 10 1 5 3 2 3 -1 -1

The columns model_index, normal_behavior and diff_normal_behavior are also vital for RCA.

Benefits

RCA brings with it additional information concerning anomalies.

Without RCA, a user can inspect the actual values versus the normal behavior values, the detected anomalies, the influencers and the anomaly indicator(s). There would however be no information about possible reasons for any detected anomaly; information that is introduced when taking into account the RCA results.

An example can be found in the picture below:

image.png

It is evident that something unusual happened on May 23rd. The anomaly indicator went above the threshold, the difference between normal behavior and actual values is more prominent than it is for surounding observations, and top line chart is marked with red dots signaling an anomaly. What was behind this increase in normal behavior remains unclear, however.

The primary motivation for RCA is thus to propagate information that explores what drives the change of normal behavior - as an uncommon difference between actual value and expected value results in a detected anomaly. Based on such information, a user can consider the appropriateness of an influencer's contributions - it can draw attention to the suspicious influencer(s), followed by an inspection of the correctness of the data measurements. This can result in the conclusion that there is a problem in the influencer(s), or that the influencers are correct and the problem is in the value of the KPI. This latter case would mean the KPI value did not follow the normal behavior value, but should. Knowing so can be helpful when making critical decisions, like whether or not to send an inspection to the machine/component that a given KPI characterizes.

To summarize, RCA should bring a user:

  • transparency, explainability and confidence in results and for making critical decisions,
  • a deeper understanding of what drives normal behavior/normal behavior change,
  • the ability to explore the possible reason behind a detected anomaly, and
  • to trust to make the final decision about anomaly candidates based on further analysis.

Interpreting root cause analysis results

Drivers of normal behavior

The RCA output reveals the involvement of each influencer in the normal behavior value for a given data point. This is a straightforward way to figure out what drives the normal behavior value.

For a given timestamp t, the sum of the influencers equals the normal behavior value:

For clarity, one of the data points from the output table earlier in this section is taken as an example:

Substituting variables with their corresponding values amounts to:

Drives of normal behavior change

The RCA output reveals the involvement of each influencer's change in the change of the normal behavior value for a given data point. This is a straightforward way to figure out what drives the normal behavior change.

Visualizing the most recent data points before the data point of interest (probably an anomalous one) can give an idea of the common contributions, thus assessing what could be wrong in an anomalous (analyzed) point.

NOTE: Differences are calculated only between the outputs generated by the same model (identified by the model_index); thus, the visualization of recent points makes sense only for recent points obtained by the same model. Also, for the best root cause analysis, it is recommended to visalize non-anomalous points, with the possible exception of the analyzed point itself.

For a given timestamp t, the sum of the influencers' changes equals the normal behavior change:

For clarity, one of the data points from the output table earlier in this section is taken as an example, supposing there is an anomaly on 2020-10-12 at 08:00:

Substituting variables with their corresponding values amounts to: