Overview of Outlier Detection Techniques in Statistics and Machine Learning

Today we will discuss the concept of Outlier Detection in Statistics and Machine Learning and we would focus on the techniques used.

We would cover the following:

What is Outlier Detection?
Application Areas of Outlier Detection
Types of Outliers
Causes of Outliers
Outlier Detection Techniques

1. What is Outlier Detection?
An outlier also called anomaly is a data point that have low probability under the model for which the predictions may be of low accuracy. The techniques applied to detect such data points is termed outlier detection or anomaly detection.

Anomaly detection and removal from dataset would always result to increase in accuracy. If we examine the two plots shown in Figure 1, it would be very easy to see datapoints that appears not to correspond with the other set of observations. The question would be, ‘how do we handle such data points?’ That is what we are going to examine under Application and Techniques of Outlier Detection.

Figure 1: Data Containing Outliers

2. Applications Areas of Outlier Detection

Medical System Monitoring: An example would be a system that monitors a patient’s pulse rate and creates a plot. An outlier in the plot would be an indication of a serious challenge that may need urgent attention

Other application intrusion detection and sensor networks.

3. Types of Outliers

There are two categories of outliers:

Univariate Outliers: These are anomalies that can be found when examining the distribution from a single feature space.

Multivariate Outliers: These are outliers/anomalies that may be found in a d-dimensional space of d-features. To manage these types of outliers requires statistical models and can be handle by statistical applications.

Outliers may also be classified as contextual outliers, such as typo errors during data entry or point outliers which are single data point separated from others.

4. Causes of Outliers

Some of the common causes of outliers include:
Sampling errors that may result from the source of the data
Deliberate added deliberately to achieve certain objectives
Human error during data entry
Measurement errors incurred from the data collection and measurement tools

5. Outlier Detection Techniques
Density-based techniques such as k-nearest neighbor
Cluster Analysis
z-Score or standard score is a parametric value that indicates how many standard deviations a data point is from the mean.
Linear Regression Models such as Principal Component Analysis.

Other techniques exist which are not covered in this article.

Final Notes
Outlier detection is a very important concept every researcher needs to appreciate. The reason is that the accuracy of the results of a research depends on the consistency of the samples used.