Skip to content

Best practices

This section serves as a guide for using TIM for Anomaly detection. By reading it, you should be able to get the best possible results, and your experiments run as smooth as possible in the most appropriate setup for your problem. Thus, such results are ready for fair evaluations and bring you confidence in what TIM can offer.

The following writing will reveal what can impact the performance of the results.

Data

Data preparation

As data are in machine learning fundamental, that is the first thing you should care about. TIM for AD accepts times series data - can be any numerical data recorded over time in sequential order. You should include all variables(columns) that could help to resolve your anomaly detection problem. You should also add enough points(rows) to allow the model to find what is normal. What means enough differs from domain to domain and depends on the granularity of data. We recommend preparing data as long as you have - it empowers you to do different experiments before picking the right model for you.

Selection of KPI and influencers

Why KPI must be selected

TIM for AD requires to select KPI. We will try to explain the reason of this, advantages and describe problems where it is fitting. The reason is that time-series data are dependency oriented data with the assumption of temporal continuity ( read more in Time Series Anomaly Detection). To put it quickly, between variables( columns) are often causal relationships so you can not just treat them independently. You have to select one or more KPI ( depends on the complexity of your problem) and assign influencers with the potential of describing what is normal behavior of a given KPI. You have to find out whether a given KPI behaves normally under given circumstances ( influencers). Imagine a simplistic example, the produced energy by wind turbine is dependent on wind speed; there is a causal relationship. So the goal is to find out whether produced energy is appropriate for given wind speed. Scatter plot of data looks as following:

image.png

To keep it simple, we are asking whether a green point is an anomaly or not. Let's take a look at how these two approaches would cope with this problem.

  1. Approach with defined KPI describes the problem as following:
    image.png
    It allows TIM to learn the real relationship between produced energy and wind speed. TIM is able to detect there is a nonlinear relationship, understanding that wind can have the highest speed ever without any increase of produced energy ( nonlinear relationship).
    image.png
    This approach is applicable for multidimensional data with relationships between columns.

  2. Approach without defined KPI describes the problem as following:
    image.png
    In this case energy produced and wind speed are treated independently ( not looking for answer how they are affecting each other). The problem is viewed from a distribution perspective.
    image.png
    This is applicable for multidimensional data where columns are not affecting each other. As an example imagine a wind farm with 10 wind turbines of the same type, having energy production of each of them. As they are located in the same place we can suppose the energy production should be about the same. For such a case the approach with no defined KPI would be the right one.
    image.png

How to select KPI

As a KPI should be selected variable/column on which you want to do anomaly detection, representing an output of a process (for instance a component of a machine). In the case of a complex system, we recommend selecting more KPI's as the process is complicated. It brings you the ability to accurately identify the location of the problem (read more in Design of experiment)

image.png

How to select influencers

Influencers are chosen based on selected KPI. They add context to the given KPI, which is vital for finding reasonable anomalies. Without circumstances that are provided by influencers can not be responsibly decided whether a KPI value is normal or abnormal. Even TIM can choose influencers automatically, the best option is to add only influencers that causally affect a KPI (we can not differentiate between correlation and causality based on data only - domain expert is needed).

image.png

Building period and its length

The quality of your model also depends on the data points and their number entering the model building ( model is created based on them). To build a good model for your problem is vital to include enough data being able to portray what is normal. And that differs from domain to domain, depending on the granularity, pattern and stationarity of data. In general, the more data, the better model does not hold. If you know that a subset of your data is sufficiently describing normal behavior, there is no need to include all your data in model building. In contrary, the building of a model will take more time and may describe normal behavior even worse (especially in cases normal behavior changes over time).

image.png

Furthermore, if you are aware of corrupted data do not include them to the model building; otherwise, the model may be seriously affected.

image.png

Number of anomalies in building period

Given the number of anomalies, the first decision you should make is whether the anomaly detection approach, in general, is the right one for your problem (see the table).

Anomaly detection Supervised learning
Very small number of positive examples Large number of positive and negative examples
Large number of negative examples Anomalous and normal classes are balanced(say at least 1:5)
Many different types of anomalies Enough positive examples for algorithm to get a sense of what anomalies are like
Future anomalies may look totally different than the anomalous examples we have seen so far Future positive examples similar to ones in building data

Of course, in most of the cases, there is no label - no information whether a point is abnormal or not (unsupervised anomaly detection); still, the percentage of anomalies should not exceed 5%. In case you know which points are normal(semi-supervised and supervised anomaly detection), the best option is to include only this data to model building, so it is not affected by abnormal points.

Quality of data

Missing data

The model is built only by using data points(rows) with no missing values. Besides, the model is built on blocks of regularly recorded data ( securing the offsets of variables always represent the same delay). Whether a model is affected by missing values depends on the percentage of points with missing values and their disposition in data. In general, missing less than 10% of the included number of points to the model building shouldn't harm the quality of the model. But, if every tenth data point had a missing value, TIM would not be able even to build a model, as there is not a sufficiently long block of regularly recorded data. In case missing values significantly affect the quality of the model, there are three ways of solving this problem:

  • Consider fulfilling the missing data based on its characteristics( for instance substitute data points from an earlier period).
  • Use API to define imputation (read more in Imputation)
"normalBehaviorModel": {
      "imputation": {
        "type": "Linear",
        "maxLength": 1
      }
    }
  • Set API parameter called “allowOffsets” to false (read more in allowOffsets)
    It cause, that there will be no need of building a model on blocks of evenly distributed data
"normalBehaviorModel": {      
      "allowOffsets": false
    }

Irregularly recorded data

Suppose you have event-based data ( the difference between two consecutive timestamps is incidental). In that case, the model will not be build or of low quality, as in default mode expects regularly recorded data.
You can cope with that by:

  • Aggregating data by a unit of time, such as per minute, hourly, or daily, for example. After aggregation, you should get regularly recorded data.
  • Set API parameter called “allowOffsets” to false (read more in allowOffsets)
    It cause, that there will be no need of building a model on blocks of evenly distributed data
allowOffsets: false

Detection perspectives

The perspective of how you look on anomalies is also the way how the user can adapt anomaly detection to his preferences. It determines the model what kind of behavior should be supposed as anomalous. Would you like to be alerted only when there is a high deviation between actual and normal behavior?

image.png

Or you prefer to be warned in case there is a deviation lasting for a longer time?

image.png

It is up to you; you can select one or more perspectives to your preference (read more in Detection perspectives).

Sensitivity

This parameter is used when building a model. TIM can find a reasonable sensitivity automatically, but it is also customizable so that you can fine-tune it to potential anomalies based on your business risk profile. It defines the decision boundary, separating anomalies from standard points. Each perspective has individual sensitivity(learn more Sensitivity).
In this section, we will provide you with some tips on how to adjust sensitivity in concrete situations according to label availability.

Unsupervised AD

Building data has no labelled instances; the implicit assumption is that normal instances are far more frequent than anomalies in the data(maximal percentage of anomalies 5 %). We recommend to let TIM find sensitivity automatically and analyze anomalies it will bring. Then you can still fine-tune this parameter, so it is optimal for your problem.

Semi-supervised AD

Building data has labeled instances only for the normal class.
We recommend:

  1. If the number and disposition of normal class instances (no anomaly) allow to build reasonable model ( see Quality of data -> Missing data) then adjust sensitivity to zero and build your model using normal class data only. It ensures, that your model is not be affected by any anomaly and sensitivity is chosen optimally.

  2. If not, build model using all data and follow recommendation for unsupervised AD

Supervised AD

Building data has labeled instances for normal as well as anomaly classes.
We recommend:

  1. follow 1. from Semi-supervised AD

  2. If not, build model using all data and set sensitivity as a percentual representation of anomalies in all data included in the model building.

Math settings

TIM automatizes the anomaly detection process concerning the mathematical side. It relates to things as feature expansion, feature reduction, selection of normal behavior and anomalous behavior model and its parameters and creation of anomaly indicator. Some of those parameters are available in API and reasonably set automatically. For those, they are interested more and would like to influence the results by playing with them, read Configuration