Dealing With Outliers In Machine Learning
No matter how careful you are during data collection, every data scientist has felt the frustration of finding outliers. An outlier is a data point that is noticeably different from the rest. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data. Wikipedia defines it as ‘an observation point that is distant from other observations.’ Outliers threaten to skew your results and render inaccurate insights. How to find and handle outliers in machine learning and its impact on models. How to deal with outliers?
Why You Shouldn’t Just Delete Outliers
Many data analysts are tempted to delete outliers. However, this is sometimes the wrong choice. Occasionally, Like in conventional analytical models, in machine learning, too, you need to resist the urge to simply hit the delete button when you come across such an anomaly, to improve your model’s accuracy. So, rather than a knee-jerk reaction, one must tread with caution while handling outliers.
One cannot recognize outliers while collecting data; you won’t know what values are outliers until you begin analyzing the data.
Many statistical tests are sensitive to outliers and therefore, the ability to detect them is an important part of data analytics.
The interpretability of an outlier model is very important, and decisions seeking to tackle an outlier need some context or rationale.
In fact, outliers sometimes can be helpful indicators. For example, in some applications of data analytics like credit card fraud detection, outlier analysis becomes important because here, the exception rather than the rule may be of interest to the analyst.
Simplistically speaking, here are some options you have when you detect outliers: accept them, correct them or delete them. If there’s a chance that the outlier will not significantly alter the outcome, you may “accept” it. Otherwise, you can either ‘correct’ it or delete it. However, you should reserve deletion only for data points that are definitely wrong.
Impact On Machine Learning Models
Machine learning algorithms, too, are at risk to the statistics and distribution of the input variables. In supervised models, outliers can deceive the training process resulting in prolonged training times, or lead to the development of less precise models.
According to Alvira Swalin, a data scientist at Uber, machine learning models, like linear & logistic regression are easily influenced by the outliers in the training data. Some models even exist that hike the weights of misclassified points for every repetition of the training.
Detecting Outliers In Statistics Normal Situations
There is no one method to detect outliers because of the facts at the center of each dataset. One dataset is different from the other. A rule-of-the-thumb could be that you, the domain expert, can inspect the unfiltered, basic observations and decide whether a value is an outlier or not.
There are more scientific methods, though. You can carry out two types of analysis to find outliers – uni-variate, which involves just one variable, and multi-variate.
Box plot: In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers in machine learning will appear separate from the plot. (Source: Wikipedia)
Scatter plot: A scatter plot is a chart type that is normally used to observe and visually display the relationship between variables. The values of the variables are represented by dots. The positioning of the dots on the vertical and horizontal axis will inform the value of the respective data point; hence, scatter plots make use of Cartesian coordinates to display the values of the variables in a data set. Scatter plots are also known as scattergrams, scatter graphs, or scatter charts. (Source: CFI)
Z-score: A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean (Source: Investopedia).
Z-score is a measure of a point’s relationship to the average of all points in the dataset. When scored, the values receive a positive or negative number. This number is the number of standard deviations above or below the average value this value is. Thus, when an analyst calculates z-scores and finds data points with a value above 1, he has found the outliers.
Detecting Outliers in Machine Learning
In machine learning, however, there’s one way to tackle outliers: it’s called “one-class classification” (OCC). This involves fitting a model on the “normal” data, and then predicting whether the new data collected is normal or an anomaly.
However, one-class classifiers can only identify if the new data is ‘normal’ relative to the data it was initially fed. In other words, the OCC will give incorrect predictions if the training set has outliers.
Author Charu C Aggarwal, in his book “Outlier Analysis”, discusses many outlier detection methods. Some notable ones include:
Probabilistic and Statistical Models: You can use statistics to identify unlikely outcomes.
- Linear Models: In this model, which is interpreted as a linear combination of features. direct correlations are used to model the data into lower dimensions. As an example, principal component analysis and data with large residual errors may be outliers.
- High-Dimensional Outlier Detection: High-dimensional data present a major challenge. Many of the current algorithms cannot address the problems from a large number of features. A paper by Aggarwal and his colleague Philip S Yu states that, for effectiveness, high dimensional outlier detection algorithms must satisfy many properties, including the provision of interpretability in terms of the reasoning which creates the abnormality.
In machine learning, one cannot just “ignore” data outliers. They can impair the training process, create cascading errors.
Image by Ronile from Pixabay
Reference: Machine Learning Mastery
An Engine That Drives Customer Intelligence
Oyster is not just a customer data platform (CDP). It is the world’s first customer insights platform (CIP). Why? At its core is your customer. Oyster is a “data unifying software.”
Liked This Article?
Gain more insights, case studies, information on our product, customer data platform