The Basic Outlier Detection Models

(Notes from Outlier Analysis – by Charu Aggarwal)

Factors influencing choice of outlier model:

Data Type
Data Size
Availability of Labeled outliers
Need for interpretability (Very desirable)

Interpretability of Model Results

A model that can describe why a particular data point is considered an outlier could provide the analyst further hints about the diagnosis and next steps required. Different models have differing levels of interpretability. Typically, models that work with the original attributes and use fewer data transforms have higher interpretability. Trade-off is some transformations can help separate outliers and normal data points better.

Feature Selection in Outlier Detection

It is hard to perform feature selection because of unsupervised nature of problem (no ground truth to guide us). A common measure of non-uniformity of univariate data is the Kurtosis measure. To compute:

Standardize data to zero mean and unit variance
Kurtosis measure computes the mean value of the fourth power of standardized data point. [formula]

Non-uniform feature distributions have high level of Kurtosis. Often used in subspace outlier detection methods – pick features with high Kurtosis. But Kurtosis does not use the interactions between various attributes, as it analyzes each feature individually. One option is to compute Kurtosis on a lower dimensional distance distribution and iteratively add features keeping the highest Kurtosis.

Second approach is to use connections of outlier detection problem to supervised learning. Features that are uncorrelated with all other features should be considered irrelevant because uncorrelated features cannot be used to model data dependencies ( we are creating a model of normal data dependencies)

Extreme Value Analysis

The most basic form of outlier detection is extreme-value analysis of 1-dimensional data. Values that are either too large or too small are outliers. The key is to determine the statistical tails of the underlying distribution. Extreme value is distinct from the definition of outliers (defined by their generative probabilities rather than extremity in their values). We can generalize to multivariate data, by determining the points at the multidimensional outskirts of the data. Note that these can only determine specific kinds of outliers.

Most Outlier modeling algorithms quantify the deviations of data points from the normal patterns in the form of a numerical score. Extreme-value analysis is the final step here to classify the extreme values as outliers.

Probabilistic and Statistical Models

Data is modeled in the form of a closed-form probability distribution, and the parameters of the model are learned. Key assumption is about the choice of the data distribution with which the modeling is performed. Use an Expectation Maximization algorithm on the observed data so likelihood of process generating the data is as large as possible. This method outputs the density-based fit of the data point to the modeled distribution and points with very low fit values may be considered outliers.

Major advantage is probabilistic models can be easily applied to virtually any data type, as long as an appropriate generative model is available. Drawback is that they try to fit the data to a particular kind of distribution. As number of parameters increases, overfitting becomes more common. Also, parametric models are harder to interpret – parameters cannot be intuitively presented in terms of the underlying attributes.

Linear Models

Linear models use linear correlations to model data along lower-dimensional subspaces.

Principle Component Analysis (PCA) is similar to Dimensionality Reduction. Here, we determine the hyperplane that minimizes the least squares error (distance) of the data to the hyperplane. This hyperplane represents a lower dimensional subspace that has the least reconstruction error after projection. Outliers will have high reconstruction error, so we use the error as the outlier score.

Dimensionality reduction and regression modeling are hard to intuitively interpret in terms of the original attributes of the data.

Proximity-Based Models

Model outliers as points that are isolated from the remaining data on the basis of similarity or distance functions. Most popular class of methods used in outlier analysis. Three types of methods:

Clustering methods
Density-based methods
Nearest-neighbor methods

In Clustering and Density approaches, we find the dense regions in the data and define outliers are points that are located far away from the dense regions. Clustering methods segment the data points, whereas the density-based approaches segment the data space.

In nearest-neighbor methods, distance of each data point to its k-th nearest neighbor is reported as its outlier score. Choose K>1 such that we can identify small clusters that are far away from the data. This method can be computationally expensive – need to compute k-th nearest neighbor for every point in dataset – O(N^2). Optimizations exist to prune number of computations (if binary labels are output instead of scores)

In Clustering methods, we follow:

Use a clustering algorithm to determine the dense regions of the data set
Score = measure of fit of data point to the different clusters – eg: distance of data point to nearest centroid.
Advisable to cluster data multiple times and average the scores from the different runs.

Density-based methods divide the data space into small regions, and the number of points in these regions are used to compute the outlier scores. Scores are interpretable if the spare regions can be presented as combinations of original attributes (Segmentation from a semantic point of view). Other density methods such as kernel density estimation not that directly interpretable.

Information-Theoretic Models

Construct a code book that can represent the data set. Outliers are defined as data points whose removal results in the largest decrease in description length. Determining the minimum-length coding is computationally intractable, so many heuristic models are used. This approach is complementary to other conventional models and use similar techniques to create the coding representation but compute the scores indirectly (expressed as a differential impact of removing the outlier from the data set). So this approach is better in settings where quantifying the coding cost is more convenient than directly measuring the deviations.

High-dimensional Outlier Detection

This is particularly challenging for outlier detection as many dimensions may be noisy and irrelevant for anomaly detection. Irrelevant attributes reduce accuracy of distance computations and impact the outlier scores negatively. In high-dimensional space, the data becomes increasingly sparse, and all pairs of data points become almost equidistant from one another. So outlier scores become indistinguishable.

Use Subspace outlier detection – use a lower-dimensional local subspace of relevant attributes. Assumption is outliers are hidden in the unusual local behavior of low-dimensional subspaces and the behavior is masked in a full analysis. Explicitly search for subspaces where anomalies can be found – how to find the correct subspace is challenging. This approach is meaningfully only used with ensemble approaches (identify multiple subspaces and combine predictions).