Anomaly detection

Anomaly detection (outliers detection) refers to the problem of finding patterns in data that do not conform to expected behavior. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains.

The problem of anomaly detection is a very challenging problem often faced in data analysis. Having any set of data, it is quite common that some items or observation do not conform to an expected pattern or other items in a dataset.

There can be several causes for this. Very often these anomalous data are the result of an error or some noise. In other cases, presence of anomalous data can indicate an irregularity that should be investigated in a deeper detail.

As we mentioned before, there exist two main causes of anomalous data:

error, noise
irregular behaviour

In the first case it is more common to speak about outlier. The second case is being called anomaly or novelty, depending on the detail context. Outliers are usually dealt with differently than anomalies. Anomalies very often bear significant information. Anomalies and their detection can be one of the primary reason of the interest in the dataset. Anomaly detection has a wide range of application such as fraud detection, abuse and network intrusion detection, surveillance, and diagnosis. With the advent of IoT, anomaly detection would likely play a key role in IoT use cases such as monitoring and predictive maintenance. Outliers, on the other hand, are considered as instances that deviate from the rest of the instances, with no deeper, intrinsic meaning, whose behavior could easily be explained as an error or noise and thus ignored. Nevertheless, in order to achieve optimal model performance regardless clustering, classification or some other machine learning, it is of great importance to identify outliers and handle them in some way.

The key aspect of any anomaly detection technique is the nature of the input data. Input data can be categorized based on the relationship present among data instances. Most of the existing anomaly detection techniques deal with point data, in which no relationship is assumed among the data instances. But in general, data instances can be related to each other. Some examples are sequence data, spatial data, and graph data. For example, in sequence data, the data instances are linearly ordered, e.g. time-series data.

Types of anomalies

Anomalies could be grouped into three following classes:

Point anomalies
Contextual anomalies
Collective anomalies

Point anomaly,is a case when individual data instance can be considered as anomalous with respect to the rest of data. Then the instance is termed as a point anomaly. This is the simplest type of anomaly and the focus of majority of research on anomaly detection. Purchase with unusual transaction value is an anomaly and can indicate a potential fraud for the issuer of the related credit card.

Contextual anomaly, is a case when data instance is anomalous in a specific context (but not otherwise). Then it is termed as a contextual anomaly. This means that observing the same point through different contexts will not always give us the indication of anomalous behavior. The contextual anomaly is determined by combining contextual and behavioral features. For contextual features, time and space are most frequently used, while the behavioral features depend on the domain that is being analyzed – amount of money spent, average temperature, or some other quantitative measures that are being used as a feature.

As a simple example temperature measurement can serve. This a time series and in this series we can find two equal high temperatures t₁ and t₂where the first value is considered as an anomaly and the second one as normal due to the context – the first value was obtained in the week before Christmas and the second one in July.

Collective anomaly, is a case when collection of related data instances is anomalous with respect to the entire data set and this data collection is termed as a collective anomaly. The individual data instances in a collective anomaly may not be anomalies by themselves, but their occurrence together as a collection is anomalous. The following picture displays collective anomaly corresponding to Atrial Premature Contraction in EKG. (Chandola et al, 2009)

Remark: Collective anomalies can occur only in data in which data instances are related, that is, they are not independent identically distributed.

Anomaly detection techniques

Handling Collective anomalies is quite challenging, domain specific and so only limited research were done. Contextual anomalies detection is very often reduced to point anomaly detection and because of this we will focus our attention to point anomalies.

One of the traditional and widespread technic is Static Rules Approach. The idea is to identify a list of known anomalies and then write rules to detect those anomalies. Rules identification is done by a domain expert, by using pattern mining techniques, or a by combination of both. The rules then can be self-implemented into the information system or there exist several expert systems that can be used. Although relatively simple, static rules-based systems tend to be brittle and complex. Furthermore, identifying those rules is often a complex and subjective task. Therefore, statistical or machine learning based approaches, which automatically learn the general rules, have been recently preferred to static rules.

If labeled data exist (this is that training data exist), it is possible to use supervised machine learning. Though standard technics are used their applications have some caveats caused by specific character of used data. Anomalies are usually very sparse in training data and so standard classification methods such as SVM or Random Forest will classify almost all data as normal because doing that will provide a very high accuracy score (e.g. accuracy is 99.9 if anomalies are one in thousand).

Generally, the class imbalance is solved using an ensemble built by resampling data many times. The idea is to first create new datasets by taking all anomalous data points and adding a subset of normal data points (e.g. as 4 times as anomalous data points). Then a classifier is built for each data set using SVM or Random Forest, and those classifiers are combined using ensemble learning.

If labeled data does not exist (which is more frequent case) unsupervised learning methods need to be applied. Methods like nearest neighbor, clustering, statistical analysis, Kolmogorov complexity were traditionally used. Recently a few new powerful algorithms have emerged – Local Outlier Factor, Isolation Forest, One-class SVM (Support Vector Machine). In order to get some feeling for anomaly detection problem let us examine few analyses of artificial data. Example from scikit-learn library will be used. We will use three datasets. We generate dataset and then some outliers. The first dataset has Gaussian distribution, the second and third contains two-well separated clusters. We randomly add uniformly distributed outliers and compare four different detection methods – One-class SVM, Robust covariance, Isolation Forest and Local Outlier Factor.

We can see that Local Outlier Factor and Isolation Forest outperform the other two methods. One-Class SVM performs the worst, Robust covariance perform well only in the case of Gaussian distributed data. In my next more technical blog I will talk about Isolation Forest1 algorithm in greater detail.

[1] Tony Liu, F., Ming Ting, K. and Zhou, Z.-H. (no date) ‘Isolation Forest’. Available at: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf