Handling missing data

Juan Manuel Ortiz de Zarate
Sep 10, 2024
11 min read

Missing data, often referred to as missing values, occur when one or more observations in a dataset are incomplete or not recorded. These gaps can arise for various reasons, such as errors in data collection, non-responses in surveys, or technical issues during data processing. When certain values are missing, it complicates the analytical process, making it difficult to derive accurate insights or conclusions. Importantly, missing values are not always explicitly labeled as such. They can be represented in different ways, such as the use of `null` values in databases, which signify the absence of information, or through special characters like a dash ('-') or blank spaces (' '). These symbols are often used as placeholders for missing data, and it is crucial for the analyst to identify and interpret them correctly to avoid misrepresentations in the data.

The presence of missing values can have a substantial effect on the conclusions drawn from any subsequent analysis. Ignoring missing data or handling it improperly can skew results, lead to biased outcomes, or even invalidate the entire analysis. For instance, if certain groups or variables are more prone to missing data, failing to account for this imbalance can introduce bias. As such, understanding and addressing the issue of missing data is not merely a technical necessity but a critical step in maintaining the integrity and validity of any analysis or decision-making process that relies on the dataset.

Rubin's classification

Rubin's classification[1] provides a structured way to understand and categorize missing data based on the underlying reasons for the absence of certain values. This classification is crucial because it helps analysts choose the most appropriate methods for handling missing data, ensuring that any analysis based on incomplete datasets remains reliable and accurate. The three categories defined by Rubin are: Missing Completely at Random, Missing at Random, and Missing Not at Random. Each of these categories has distinct characteristics that can influence how we address the gaps in the data.

llustrations of the classification for the mechanism of missing data. Blue points are observations whereas red points are missing observations in the y-variable. Source [5]

Missing Completely at Random

When data is classified as Missing Completely at Random (MCAR), the absence of values is entirely independent of any observed or unobserved data. This means that the probability of missing data is the same for all observations, regardless of their characteristics. In this case, there is no relationship between the missing data and any other variable in the dataset, making the missingness purely coincidental.

Uniform Probability: The likelihood that data is missing is equal across all observations. For example, a respondent might forget to answer a survey question purely by accident, and this missingness is not influenced by any other variables or responses.
Example: An example of MCAR could be data lost due to a technical glitch in a survey system, where a random subset of responses was not recorded, regardless of the respondents' demographics or other characteristics.

This category is the easiest to manage, as the randomness of the missing data allows analysts to proceed without concern for bias introduced by the missingness. Simple techniques like deleting the missing observations or averaging across other available data are sufficient and reliable in these cases.

Missing at Random

Missing at Random (MAR) describes a situation where the missing data is not entirely random, but its absence can be explained by other variables in the dataset. While the missingness itself is unrelated to the missing data, it correlates with values in other columns or variables that are already observed. In other words, the data isn't missing just by chance, but the missingness is systematically related to other available information.

Conditional Missingness: The probability of missing data is linked to other observed variables but not to the missing data itself. For example, missing data in one variable might be related to another variable that is fully observed. In this way, analysts can potentially use other data points to predict the likelihood of missingness.
Example: A typical example of MAR occurs in clinical trials. If a study participant drops out after experiencing side effects from a new treatment, the data might be missing for those patients only, but the missingness is associated with the observed side effects, not the missing treatment outcomes.

Because the missingness in MAR can be related to other variables, the missing data can often be estimated or modeled. Handling MAR appropriately is crucial because ignoring this type of missing data or applying inappropriate techniques could result in biased outcomes.

Missing Not at Random

In the Missing Not at Random (MNAR) category, the missing data is related to the value of the data itself. This is the most challenging type of missing data to address because the missingness is likely caused by the very information that is missing, creating a feedback loop of sorts. As a result, simply ignoring the missing data or using traditional imputation methods can introduce significant bias.

Non-Random Missingness: The reason for missing data is not arbitrary but often directly tied to the value of the missing data. For instance, if people with higher incomes are less likely to report their earnings, the missing income data is systematically linked to the value itself, creating a pattern of missingness that cannot be ignored.
Example: In surveys about personal finance, respondents with higher incomes might be reluctant to disclose their earnings, leading to disproportionately missing data for wealthier individuals. The missingness is not random but tied to the actual value of income.

Failing to properly account for MNAR can lead to highly misleading results, as the missingness is driven by systematic factors related to the variable of interest. This makes it crucial to carefully investigate and handle MNAR cases with specialized approaches, potentially involving expert knowledge or external data sources to inform the analysis.

Data Imputation

It is a crucial technique used to address the issue of missing data by replacing the absent values with statistical estimates. The core goal of any imputation technique is to create a complete dataset, ensuring that it can be effectively used in various analytical tasks, particularly in training machine learning models. When data is missing, it poses a significant problem for many algorithms that require a complete dataset to function properly. Imputation allows us to maintain the integrity of the dataset by filling in the gaps with reasonable estimates, ensuring that the analysis can proceed smoothly without being distorted by missing values.

The main purpose of imputation is to minimize the biases and inaccuracies that might arise from ignoring or improperly handling missing data. A dataset with missing values can lead to biased models, reduced statistical power, and misleading conclusions. By replacing the missing values with estimated ones, we can ensure that the dataset remains comprehensive and suitable for analysis. The imputed dataset enables more accurate training of machine learning models, which in turn leads to better predictions and insights.

Imputation is not just about filling in gaps arbitrarily; the key is to ensure that the imputed values are as close to the true values as possible. Poorly imputed data can introduce new biases or distort relationships within the dataset. Thus, selecting the right imputation method is essential for maintaining the quality and reliability of the analysis.

Univariate Imputation

Univariate imputation methods are applied independently to each variable, filling in missing values without considering the relationships between different variables. These methods are typically simpler and computationally less intensive than multivariate techniques, and they can be very effective when the missing data is minimal or randomly distributed. However, since they do not account for correlations between variables, univariate methods may introduce bias or reduce the predictive accuracy of models in more complex datasets. Below are several univariate imputation techniques, categorized by the type of variable they apply to.

Numerical Variables

When dealing with missing values in numerical data, there are several common approaches to univariate imputation:

Imputation by Mean/Median: This is one of the most straightforward imputation methods. For any missing value in a numerical variable, the missing data is replaced by the mean (or median) of the observed values in that variable. This method works well when the data is Missing Completely at Random (MCAR) and when the distribution of the data is relatively symmetrical. However, it can distort the variance of the dataset by artificially clustering values around the mean or median.

Example of a dataset with imputed values through the mean

Arbitrary Value Imputation: In this method, missing values are replaced with a fixed, arbitrary value. This might be a specific number, such as zero, or a designated placeholder value that is outside the normal range of the observed data. While simple, arbitrary value imputation can sometimes create unrealistic data points, so it should be used with caution.
End-of-Tail Imputation: This technique involves replacing missing values with a value from the extreme end (or "tail") of the data distribution. For example, if a dataset contains income data and some values are missing, an end-of-tail imputation might replace those missing values with very high or very low income values. This method is useful in some scenarios to preserve variability, but it can create extreme outliers that may affect the results.

Categorical Variables

For categorical data, where variables represent categories or labels rather than numerical values, univariate imputation techniques take on a different form:

Frequent Category Imputation: A common method for categorical variables is to replace missing values with the most frequently occurring category (also known as mode imputation). For instance, if a column contains gender data with 'Male' and 'Female' as categories, and 'Male' is the most frequent category, missing values can be replaced with 'Male'. This method is easy to apply but can introduce bias by over-representing the most frequent category.
Add a “Missing” Category: Another approach is to treat missing values as a distinct category in itself. A new category, such as "FALTANTE" or "Missing," is added to represent observations where data is absent. This method preserves the information about missingness and can be helpful when the fact that data is missing carries its own significance.

Methods Applicable to Both Numerical and Categorical Data

Some univariate imputation techniques can be applied to both numerical and categorical variables, offering flexible solutions for various types of data:

Complete Case Analysis (Listwise Deletion): This is not a true imputation technique, but it is a common method used when dealing with missing data. In complete case analysis, any observation that contains missing data is entirely excluded from the analysis. While this can simplify the analysis, it may lead to a loss of valuable data and reduced statistical power, especially if many observations are removed.
Add a Missingness Indicator: In this method, a new binary variable is created for each column with missing data, indicating whether a value is missing or not. For example, if some income values are missing, a new variable could be added with "1" for missing values and "0" for observed values. This approach allows the analysis to incorporate the fact that data was missing, which can sometimes be informative in itself.
Random Sampling Imputation: This method involves randomly selecting an observed value from the same variable to replace each missing value. The random sampling ensures that the distribution of the imputed values mirrors the distribution of the observed values. This can work well when the data is MAR (Missing at Random) and helps preserve the natural variability of the dataset. However, it does not capture relationships between variables.

Multivariate Imputation

While univariate imputation techniques replace missing values by focusing solely on individual variables, multivariate imputation methods take a more comprehensive approach. These methods utilize the information from multiple variables in the dataset to make more accurate and informed estimates of the missing values. By considering the relationships and correlations between variables, multivariate techniques can often produce better, less biased imputations than univariate methods. This is particularly useful in datasets where variables are interdependent, as the absence of data in one column may be related to the values in other columns.

The key advantage of multivariate imputation is that it allows analysts to use the full breadth of information available in a dataset to fill in the gaps. This reduces the likelihood of introducing bias and helps preserve the relationships between variables, making the imputed data more reflective of the true underlying patterns. Multivariate imputation methods are especially important in datasets with non-random missingness or when dealing with complex datasets where variables are highly correlated.

Among the various multivariate imputation techniques, two popular methods are MICE (Multiple Imputation by Chained Equations) and K-Nearest Neighbors (KNN) Imputation. Both of these techniques leverage the relationships between variables to estimate missing values, but they do so in different ways.

MICE

MICE[2] is one of the most widely used and flexible multivariate imputation techniques. It works by creating multiple imputed datasets through a series of iterations, where missing values in each variable are imputed using regression models that incorporate the other variables in the dataset. These models are updated iteratively to refine the estimates.

Chained Equations: MICE works by imputing missing values in a sequence of steps, where each missing value is replaced using a predictive model based on the other variables. For instance, if a dataset has missing values in both age and income, MICE might first estimate the missing age values based on other variables like education and occupation, then use the newly imputed age values to help impute the missing income values.
Multiple Imputations: Instead of creating a single imputed dataset, MICE generates multiple datasets with slightly different imputed values for each missing observation. This process reflects the uncertainty around the true value of the missing data. After imputation, these datasets are analyzed separately, and the results are combined to produce a more robust, reliable final analysis.
Flexibility: MICE is highly flexible and can handle a variety of data types (numerical, categorical) and patterns of missingness. It allows for different models to be used for different variables, which can improve accuracy in datasets with diverse types of data.

MICE is particularly effective when dealing with datasets where the relationships between variables are complex, and it helps to account for the uncertainty associated with missing values by creating multiple plausible imputed datasets.

KNN Imputation

KNN[3] imputation is another popular multivariate technique that uses a proximity-based approach to estimate missing values. Unlike MICE, which relies on regression models, KNN imputation estimates missing values based on the values of "neighboring" observations in the dataset.

Neighbor-Based Imputation: KNN imputation works by identifying the K observations (or "neighbors") that are most similar to the observation with the missing value. The missing value is then imputed by averaging (or taking the most frequent category, in the case of categorical data) the values of these neighbors. For numerical data, the neighbors are chosen based on the Euclidean distance or another distance metric that reflects similarity between observations.
Using Relationships Between Variables: KNN imputation takes into account the relationships between all variables in the dataset when calculating proximity between observations. For example, if income data is missing for a certain observation, KNN would search for other observations with similar characteristics (such as age, education, and occupation) and use their income values to estimate the missing one.
Non-Parametric: One of the main advantages of KNN imputation is that it is non-parametric, meaning it does not make assumptions about the underlying distribution of the data. This makes it particularly useful for datasets with non-linear relationships or variables that do not follow normal distributions.
Flexibility Across Data Types: KNN imputation can be applied to both numerical and categorical data, making it versatile across different types of datasets. However, it can become computationally intensive for large datasets, especially when the number of neighbors (K) or the number of variables is large.

Conclusion

Effectively handling missing data is essential for ensuring the accuracy and reliability of any analysis, particularly in machine learning and statistical modeling. Univariate imputation methods offer simplicity and efficiency for small, independent datasets, while multivariate approaches like MICE and KNN provide more sophisticated solutions by leveraging the relationships between variables. Choosing the right imputation method depends on the nature of the dataset and the patterns of missingness. Ultimately, by addressing missing data thoughtfully and appropriately, analysts can preserve the integrity of their models and make better-informed decisions based on complete, reliable datasets.

References

[1] Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.

[2] White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4), 377-399.

[3] Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), 1883.

[4] K-nearest neighbor, Scholarpedia

[5] Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in ecology & evolution, 23(11), 592-596.