Robustness in Regressions

Juan Manuel Ortiz de Zarate
Jul 12, 2024
10 min read

Updated: Aug 6, 2024

In this article, we will present techniques for performing regressions in a robust manner. Robust regressions aim to adjust estimates while avoiding the distortions caused by outliers.

To achieve this, we will explore two main approaches. The first approach involves modifying the loss function to penalize the residuals of outlying observations. By altering the loss function in this way, the influence of outliers on the regression model is minimized, leading to more accurate and reliable estimates.

The second approach employs algorithmic techniques, which involve making numerous estimations and selecting those that are not influenced by outliers. By iteratively evaluating different subsets of the data, it becomes possible to identify and rely on more stable estimates that reflect the underlying trend without being skewed by extreme values.

These methods help ensure that the regression model remains accurate and reliable, even in the presence of atypical data points. Through this article, we will delve into the specifics of these techniques and demonstrate how they can be applied to enhance the robustness of regression analyses.

But first, let’s review what an outlier is!

Outliers

An outlier is defined as "an observation that deviates so markedly from other observations as to arouse suspicions that it was generated by a different mechanism" (D. Hawkins, "Identification of Outliers," 1980). In the literature, outliers are often referred to as abnormalities, discordances, deviations, or anomalies.

Outliers can have a significant impact on regression analyses, distorting the results and leading to incorrect conclusions. They can affect the accuracy of predictive models, making it essential to identify and manage them effectively. By understanding what outliers are and how they can influence our data, we can apply robust regression techniques to mitigate their effects and improve the reliability of our models.

The concept of outliers is subjective and context-dependent. Deciding whether to remove, ignore, or investigate them further depends on the specific problem and the data set at hand. For example, consider a user indicating they are 2 meters tall and 40 Kg weight. This data point is highly unusual and likely erroneous, suggesting it could be excluded from the analysis to avoid skewing the results. In another case, suppose a travel agency records a purchase of a hotel and flight package for a trip scheduled five years in advance. This might be uncommon but not necessarily an error, indicating a different approach might be needed to handle such outliers.

Example of outliers in human measurements

Outliers must be carefully evaluated to determine their legitimacy and impact on the analysis. Removing them blindly could result in the loss of valuable information, while ignoring them could compromise the integrity of the model. The goal is to strike a balance, ensuring that the regression model remains robust without discarding meaningful data.

In some situations, our primary task is to detect and analyze these outliers. For instance, in credit card fraud detection, identifying unusual, frequent transactions in different locations can signal a stolen card. Similarly, abrupt changes in sensor readings on satellites can indicate significant weather events. In these cases, outliers are not merely noise to be filtered out but critical signals that provide valuable insights into the phenomena being studied. Therefore, robust techniques are crucial, not only for making accurate predictions but also for uncovering significant anomalies that warrant further investigation.

However, in other cases, outliers can be problematic because they lead the model to pay undue attention to highly improbable events. When a model can inherently handle such anomalies effectively, we describe it as robust. Robust models can distinguish between significant outliers that provide valuable information and those that are simply noise, thereby maintaining their predictive accuracy without being misled by unlikely data points. This capability is essential for building reliable models that perform well across diverse and noisy datasets.

Outliers in Regressions

Outliers in regressions are observations that are significantly distant from the rest of the data. Their identification and treatment are crucial because they can greatly influence the results of a regression analysis. This concept is relative to the specific model under consideration, meaning that an outlier in one model might not be an outlier in another.

The presence of outliers can be attributed to various factors. They may result from measurement errors, excessive noise, or randomness. In some cases, an outlier might belong to a different family or group than the rest of the data, indicating a fundamentally different underlying process.

Understanding the origin and nature of outliers is essential for determining the appropriate approach to handle them. By recognizing their potential causes—whether they stem from errors, noise, or genuine differences—we can make informed decisions about whether to exclude, correct, or otherwise address these anomalies in our regression analyses. This nuanced approach helps ensure that our models remain robust and reliable, providing accurate and meaningful insights despite the presence of outliers.

Types of Outliers

In regression analysis, outliers can be categorized into several types based on their leverage and influence on the model. Understanding these distinctions is crucial for addressing each type appropriately:

Low Leverage Outlier: These are data points that do not significantly deviate from the predictor variables' range but have a large residual, meaning their observed value is far from the value predicted by the model. While they can affect the accuracy of the model, their overall influence on the regression equation is limited due to their typical leverage.
High Leverage Point: These points have extreme values in the predictor variables, placing them far from the center of the predictor space. However, if their observed value aligns closely with the predicted value, they might not have a large residual. Despite not being outliers in terms of the response variable, they have the potential to exert significant influence on the regression model because of their position in the predictor space.
High Leverage Outlier: This type of outlier combines the characteristics of the previous two types. These points are extreme in terms of both the predictor variables and the response variable. They have high leverage and large residuals, making them particularly influential and potentially disruptive to the regression model. Their presence can significantly skew the results, making it essential to detect and address them appropriately.

By identifying and understanding these different types of outliers, we can apply specific techniques to mitigate their impact, ensuring that our regression models remain robust and reliable.

Problem

Outliers can significantly modify the fit of a regression model, distorting conclusions and leading to misleading insights. Their influence can be so strong that they may completely alter the relationships inferred from the data, resulting in unreliable predictions and interpretations.

Identifying these influential points is a critical part of the regression analysis process. By detecting and understanding outliers, analysts can take appropriate measures —such as modifying the model, excluding the outliers, or using robust regression techniques— to mitigate their impact and ensure that the model accurately reflects the underlying data patterns. This careful consideration of outliers helps in maintaining the integrity and reliability of the regression analysis.

Ordinary Least Squares Regression

Ordinary Least Squares (OLS) regression is a widely used strategy for fitting a regression model by minimizing the sum of the squared differences between the observed and predicted values.

However, OLS is highly sensitive to outliers. Outliers can disproportionately influence the fit of the regression line because OLS gives equal weight to all observations while minimizing the squared residuals. As a result, even a single outlier with a large residual can significantly alter the slope and intercept of the regression line, leading to distorted estimates, unreliable predictions, and incorrect conclusions about the relationships within the data.

Example of a regression estimated through OLS over data with a high-leverage outlier

To address this issue, instead of using the squared residuals, we can utilize alternative loss functions. Minimizing the Residual Sum of Squares (RSS) is equivalent to finding the β coefficients that minimize this sum. By replacing the squared residuals with another function that is less sensitive to outliers, we can achieve a more robust regression model. This approach helps reduce the undue influence of extreme data points, leading to more reliable and accurate estimates that better reflect the underlying data patterns.

L1 Estimator

The L1 estimator, also known as Least Absolute Deviations (LAD), is an alternative to the OLS regression. By removing the square and using the absolute value of the residuals, the influence of outliers is reduced. However, outliers can still have a considerable impact.

In the L1 estimator, the loss function is the sum of the absolute differences between observed and predicted values. This approach lessens the weight of outliers compared to the squared residuals used in OLS. While this makes the regression more robust to outliers, it does not eliminate their influence entirely.

This equation is also used as a regularization technique called Lasso Regression, which has the advantage that some of the weights are driven (if they are strongly correlated) to zero so it can be also used as a feature selection method as well.

A significant drawback of the L1 estimator is that the absolute value function is not smooth and thus is not differentiable at zero. This lack of smoothness complicates the optimization process, making it more challenging to derive and solve for the β coefficients using traditional calculus-based methods. Despite this, the L1 estimator remains a valuable tool for robust regression, particularly in scenarios where reducing the influence of outliers is crucial.

LMS Estimator

The Least Median of Squares (LMS) estimator is another robust regression technique designed to reduce the influence of outliers. This method uses the median of the squared residuals instead of the mean, which significantly diminishes the impact of extreme values.

The median, being a central tendency measure, is less affected by outliers compared to the mean. By focusing on the median of the squared residuals, the LMS estimator effectively down weights the outliers, leading to a more robust fit.

However, the LMS estimator also has its challenges. Like the L1 estimator, it is difficult to handle computationally due to its non-differentiable nature. The optimization process for LMS can be complex and computationally intensive, often requiring iterative algorithms and heuristic methods to find a solution. Despite these challenges, the LMS estimator is a powerful tool for robust regression, particularly in datasets with significant outliers.

M-Estimators for Regression

M-estimators[1] for regression represent a generalization of various loss functions used to enhance robustness in regression models. These estimators extend the concept of minimizing different loss functions beyond the simple squared residuals or absolute deviations, providing greater flexibility and robustness against outliers.

Common functions used for M-estimators include quadratic, absolute value (L1 norm), Huber[4], and bisquare. Each of these functions offers a different way to penalize residuals, with some providing a compromise between sensitivity to small errors and robustness to large deviations.

Once the loss function ρ is determined, the goal is to find the β coefficients that minimize this function. This is achieved by deriving the loss function and solving for the β values that set the derivative to zero. The first derivative of a function is zero at its maximum or minimum points, and given the nature of our ρ possible functions, this will tend to correspond to a minimum. Only the bisquare function has maximum values. To understand better this concept, in the following picture you can see the shape of each function.

In the following image we can see the same dataset pictured before but showing also how a regression using the M-estimator could fit it.

Example of regressions estimated through OLS and M-estimator over data with a high-leverage outlier

RANSAC: RANdom SAmple Consensus

It is an iterative method for estimating a robust regression model in the presence of outliers[2]. It is particularly effective when the data contains a significant proportion of outliers. Below is the pseudocode for the RANSAC algorithm:

Initialize: Set Bestmodel to an empty array, Besterror to infinity, and specify parameters t, n, and d.
Select Random Subset: Randomly select n values from the dataset.
Fit Model: Fit a regression model using these n values.
Identify Inliers: Evaluate which of the remaining points could belong to this model (i.e., are within a distance d from the model) and classify these as potential inliers.
Check Inliers: If the number of inliers is greater than t, consider this a potentially good model.

Refit Model: Refit the regression model using all the identified inliers.
Evaluate Error: Calculate the error of this model on the inliers.
Update Best Model: If this error is less than Besterror, update Besterror with this error and set Bestmodel to the current model.
Otherwise, return to step 1.

6. Repeat: Repeat the process for k iterations.

This process allows the RANSAC algorithm to iteratively refine the model, improving its robustness by focusing on data points that consistently fit a potential model while discarding outliers.

Key Characteristics:

Non-Deterministic and Iterative: It iterates by randomly selecting data subsets, fitting a model, and evaluating it. Each run can yield different results due to the randomness involved.
Robust Estimation: RANSAC provides robust estimates even when there is a high proportion of outliers in the data.
Potential Drawbacks: The algorithm might not reach a consensus within the given number of iterations, potentially failing to identify a good model if the number of iterations is insufficient.

Hyperparameters:

n: Number of points initially taken to fit the model.
t: Number of inliers required for a model to be considered potentially good.
d: Maximum distance from the model at which a point is considered an inlier.

Theil-Sen Estimator

This algorithm[3] calculates the median of the slopes determined from all pairs of points in the dataset and uses this median to estimate the intercept. The Theil-Sen estimator does not rely on hyperparameters, making it straightforward to apply.

The algorithm is as follows:

Calculate Slopes: For each pair of points in the dataset, calculate the slope of the line that passes through these two points.
Median Slope: Sort the calculated slopes and select the median slope.
Calculate Intercepts: Using the median slope, calculate the intercept for each i point in the dataset using the formula

4. Median Intercept: Sort the intercepts and select the median intercept.

Key Characteristics:

Robustness to Outliers: The Theil-Sen estimator is highly robust to outliers, as it relies on medians rather than means, reducing the influence of extreme values.
Time Complexity: Depending on the number of data points, the algorithm can have a time complexity of

where n is the number of data points.

No Hyperparameters: Unlike many robust regression algorithms, the Theil-Sen estimator does not depend on any hyperparameters, making it easier to implement and apply.

In certain cases, Theil-Sen can outperform RANSAC, as shown in the following example where x-axis outliers perturb RANSAC. Tuning RANSAC's residual_threshold parameter can help, but prior knowledge about the data and outliers is generally required. Due to Theil-Sen's computational complexity, it is recommended for small problems with few samples and features.

An example of theil-Sen outperforming OLS and RANSAC. Source[5]

Conclusions

In this article, we explored various techniques for performing robust regressions, ensuring estimates are not distorted by outliers. Outliers, which significantly deviate from other data points, can greatly impact regression models, especially with Ordinary Least Squares (OLS) methods. Identifying and managing these outliers is crucial to avoid erroneous conclusions.

Robust approaches include modifying loss functions, such as using absolute value (L1) or Huber functions and employing specific estimators like the M-estimators, which reduce the influence of outliers. Iterative algorithms like RANSAC and Theil-Sen provide robust models by applying computational techniques instead of statistical ones.

In summary, robust regression techniques improve model accuracy and reliability, even with many outliers. By applying these methods, analysts can derive more valid and generalizable conclusions, ensuring regression models remain resistant to anomalies.

References

[1] Maronna, R. A., Martin, R. D., Yohai, V. J., & Salibián-Barrera, M. (2019). Robust statistics: theory and methods (with R). John Wiley & Sons.

[2] Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381-395.

[3] Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis. Indagationes mathematicae, 12(85), 173.

[4] Huber, P. J. (1992). Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution (pp. 492-518). New York, NY: Springer New York.

[5] Theil-Sen Regression

Robustness in Regressions

Outliers

Outliers in Regressions

Types of Outliers

Problem

Ordinary Least Squares Regression

L1 Estimator

LMS Estimator

M-Estimators for Regression

RANSAC: RANdom SAmple Consensus

Key Characteristics:

Hyperparameters:

Theil-Sen Estimator

Key Characteristics:

Conclusions

References

Recent Posts

Comments