(Notes from Outlier Analysis – Charu C Aggarwal – Chap 1.1)
An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
Data is created by generating processes (system activity or monitoring activity) – unusual behavior creates outliers. An outlier often contains useful information about abnormal characteristics of systems and entities that impact the data generation process. In most applications, the data has a “normal” model and anomalies are deviations from this. Many times, outliers correspond to sequences of multiple data points rather than individual data points.
The output of outlier detection algorithms can be of two types:
- Outlier Scores: Quantify the level of “outlierness” of data point. Use Score to rank data points in terms of their outlier tendency.
- Binary Labels: Is a data point an outlier or not? Generally, scores > specified threshold can be converted to a Yes label.
Outlier vs Anomalies
In real applications, data may be embedded in significant amount of noise and such noise may not be of interest to the analyst. Outlier and Anomalies are generally used interchangeably but one subtle distinction – “outlier” refers to a data point that could either be considered an abnormality or noise, whereas an “anomaly” refers to a special kind of outlier that is of interest to an analyst.
Noise : Modeled as a weak form of outliers that does not always meet the strong criteria necessary for a data point to be considered interesting or anomalous enough.
Every data point lies on a continuous spectrum from normal data to noise, and finally to anomalies. Separation between noise and anomalies is not pure and often chosen on an ad-hoc basis. Anomalies typically have a higher score than noise but generally it is the interest of the analyst that regulates the difference between noise and an anomaly.
So the best way to find anomalies and distinguish them from noise is to use the feedback from previously known outlier examples of interest. Supervised Outlier detection techniques are typically much more effective to sharpen the search process towards more relevant outliers. Anomalies need to be unusual in an interesting way and the supervision process re-defines what one might find interesting.
Generally, unsupervised methods can be used either for noise removal or anomaly detection, and supervised methods are designed for application-specific anomaly detection. Unsupervised methods are often used in an exploratory setting, where the discovered outliers are provided to the analyst for further examination of their application-specific importance. The level of supervision in practical scenarios depends on how many examples of normal and anomalous data are available.