Outlier Analysis: The Data model is everything

(Notes from Outlier Analysis Chap1: by Charu C Aggarwal)

All outlier detection algorithms generally follow this approach:

  • Create a model of normal patterns in the data
  • For given data point, compute outlier score based on deviations from this pattern. This is done by evaluating the quality of the fit between the data point and the model

So clearly the choice of the data model is crucial. Unfortunately, outlier detection is largely an unsupervised problem in which examples of outliers are not available to learn the best model. So the choice of model is dictated by analyst understanding of the kinds of deviations relevant to an application..

Z-value test

For 1-dimensional data, the Z-value computes the number of standard deviations by which a data point is distant from the mean. Z-values can be used as a proxy for the outlier score of the point. If the mean and standard deviations can be estimated accurately, Z-values > 3 is a good “rule-of-thumb” for anomalies. The Z-value test implicitly assumes a normal distribution for the underlying data. If the data is not a normal distribution, the Z-value test is not very meaningful (for interpretability).

How to choose the data model

Mistakes made at the modeling stage can result in incorrect understanding of the data. Also, tests like z-value above may be appropriate in some cases but not in others. The effectiveness of a model depends both on the choice of the test used and how it is applied.

Choosing the best model requires an understanding of the underlying data. Therefore,

  • Need to make assumptions about structure of the normal patterns in data set
  • Choice of “normal” depends highly on analyst’s understanding of natural data patterns in that particular domain.

There are many trade-offs associated with model choice:

  • Complex models with many parameters: May overfit the data and may fit outliers also.
  • Simple model constructed with intuitive understanding of data/analyst help may lead to better results but an oversimplified models that fits data poorly may declare normal patterns as outliers.

The initial stage of selecting the data model is perhaps the most crucial one in outlier analysis.

Connections with Supervised Models

We can treat outlier detection as a variation of the classification problem where the class label (“normal” or “anomaly” ) is unobserved. Pretend that entire data set contains the normal class and create a model of the normal data (one-class model). Deviations from this normal class are treated as outlier scores. Many methods of classification therefore generalize to outlier detection. One-class models are trickier that multi-class models because it is easier to distinguish between examples of two classes than to predict whether a particular instance matches a single class.

Instance based learning methods

  • Training model not constructed up-front.
  • For given test instance, compute most relevant (closest) instances of training data set, and makes predictions of test instance using these instances.
  • Eg: Use k-nearest neighbor distance as outlier score.
  • Extremely popular – Simple, effective and intuitive.

Explicit Generalization Methods

In principle, almost any classification method can be redesigned to create a one-class analog.

  • Create a one-class model of the normal data set. Model represents an explicit generalization of the data set.
  • Score each data point based on its deviation from this model of normal data.
  • Problem: Same data set used for both training and testing. Hard to exclude specific test points because generally no labelling available in unsupervised problems. This could cause overfitting.
  • Effective approach to reduce overfitting: Repeatedly randomly partition data into training and test data and average outlier scores of test data from multiple models.
  • Eg: Principle Component Analysis (PCA), Clustering, SVM etc can be generalized for outlier analysis.

Leave a comment