Dimensionality Reduction: Linear methods

Juan Manuel Ortiz de Zarate
Jun 14, 2024
10 min read

Updated: Jul 23, 2024

In the realm of data analysis and machine learning, dealing with high-dimensional data can often be challenging. High-dimensional datasets can be difficult to visualize, process, and analyze effectively. To address this issue, we use dimensionality reduction techniques. These techniques aim to project high-dimensional data onto a lower-dimensional space while retaining as much important information as possible. This process simplifies the data, making it easier to work with and understand.

When performing dimensionality reduction, our objective is not only to reduce the number of dimensions but to do so while preserving some form of optimality. This optimality refers to maintaining critical properties of the original dataset, such as variance, distances between data points, or class separability, depending on the specific method used.

There are primarily two main approaches to dimensionality reduction:

Feature Selection: This approach involves selecting a subset of the original features (dimensions) of the data based on certain criteria. The goal is to retain the most informative features while discarding the less important ones. Feature selection methods include techniques such as filter methods, wrapper methods, and embedded methods. Each of these methods has its own way of evaluating the importance of features and selecting the optimal subset.

Feature Extraction: Unlike feature selection, feature extraction creates new features by transforming the original data. These new features are combinations or projections of the original features and are designed to capture the most significant variations in the data.

This second approach is what we will be diving into in this article.

Why would I want to reduce the dimensionality?

These techniques serve several important purposes in data analysis and machine learning. The main objectives of employing these techniques include:

Visualization: High-dimensional data is challenging to visualize directly. By reducing the number of dimensions, we can project the data into 2D or 3D spaces, making it easier to visualize and understand the underlying patterns and relationships. Effective visualization helps in exploratory data analysis and gaining intuitive insights.

Noise Reduction: Huge datasets often contain noise, which can obscure the underlying structure of the data. Dimensionality reduction can help filter out this noise, focusing on the most significant features and thus improving the quality of the data.

Data Regularization: Reducing the dimensionality can act as a form of regularization, preventing overfitting in machine learning models. By simplifying the data, we reduce the risk of models capturing noise and irrelevant details, leading to better generalization of new, unseen data.

Information Compression: They allow for compressing the information content of a dataset. This is particularly useful for storage efficiency and speeding up data processing. Compressed data retains essential information while occupying less space, making it easier to handle and transmit.

Computational Efficiency: Complex datasets data require significant computational resources for processing and analysis. By reducing the number of dimensions, we decrease the computational load, making it more feasible to apply complex algorithms and models. This efficiency is crucial for handling large datasets and real-time applications.

Beyond these primary objectives, they can aid in other tasks such as feature engineering, anomaly detection, and improving the performance of clustering algorithms. Each application might emphasize different aspects of dimensionality reduction based on the specific requirements and goals of the analysis.

World Maps

A familiar example of dimensionality reduction is the creation of world maps. The Earth is a three-dimensional sphere, and projecting this 3D surface onto a 2D plane inevitably involves some form of dimensionality reduction. In this process, certain distortions or loss of information are unavoidable. These distortions manifest in various ways, such as the sizes and shapes of countries and continents.

A world map is a dimensional reduction of Earth.

For instance, in many commonly used world maps, countries closer to the poles, such as Greenland and Russia, appear significantly larger than they are in reality [1]. This distortion occurs because the projection methods used to flatten the globe into a 2D map cannot perfectly preserve all geographic properties. As world maps are created through dimensionality reduction, they strive to retain important characteristics while reducing dimensions. So, projections aim to preserve essential geographical features but must balance compromises in area, shape, distance, and direction.

We will now delve into some of the most popular techniques used to achieve this. These methods each offer unique approaches to reducing dimensions, balancing the trade-offs between preserving important data characteristics and simplifying the dataset. Let’s explore these techniques in detail, highlighting their principles, applications, and advantages.

Principal Component Analysis

Principal Component Analysis (PCA) is one of the most widely used techniques for dimensionality reduction. Given a variable X = (x1, x2,…,xp), the goal is to find a set of orthogonal linear projections of X such that these projections are ordered in descending order according to their variance. This means that the first projection captures the maximum variance in the data, the second projection captures the next highest variance, and so on.

Disclosures

Variance as a Measure of Information: Here, variance is used to quantify the amount of information in the data. The higher the variance of a projection, the more information it retains. Therefore, variance serves as a criterion for determining the optimality of the projections.
Orthogonal Projections: The output involves a rotation or change of the coordinate system such that the projections (principal components) are uncorrelated. This orthogonality ensures that each principal component captures a unique aspect of the data's variance.

The core idea behind this technique is to rotate the dataset in a way that maximizes the variance of the data in orthogonal projections. By doing so, PCA identifies the directions (principal components) along which the data varies the most. These directions are then used to re-express the data in a lower-dimensional space while retaining as much information as possible.

Goals of PCA at inferring dimension to project on. Source [2]

Assumption

PCA operates under the assumption that the data predominantly lies in a lower-dimensional linear subspace. This means that while the original data may exist in a high-dimensional space, the essential structure and variability of the data can be captured in a space with fewer dimensions.

In the following sections, we will explore the mathematical foundations of PCA, the process of computing the principal components, and its practical applications.

Calculation

The math behind this involves identifying the directions in which the variance of the data is maximized. These directions are found by analyzing the covariance matrix of the data. First, compute the covariance matrix, Σ, of the dataset. It captures the pairwise relationships between the variables in the data. Then, to find the directions that maximize variance, solve the eigenvalue problem for Σ. This involves finding the eigenvectors and corresponding eigenvalues of Σ.

The direction in which the variance is maximized is given by the eigenvector u of Σ with the largest associated eigenvalue λ. This is because the eigenvalue λ represents the amount of variance in the data when projected onto the direction of its corresponding eigenvector u. Therefore, the largest eigenvalue indicates the direction where the data has the highest variance, making its eigenvector the optimal direction for capturing the most significant variance in the data.

So, u represents the first principal component, and λ quantifies the amount of variance explained by this principal component.

Eigenvectors and eigenvalues represents the direction and magnitude of the principal components

Once the eigenvectors and eigenvalues are obtained, the eigenvectors (principal components) are ordered by their corresponding eigenvalues in descending order. The first principal component has the highest eigenvalue, thus explaining the most variance, followed by the second principal component, and so on.

After obtaining all principal components, select the top N components that explain the most variance. The number N is chosen based on how much total variance needs to be retained. Typically, this involves retaining enough components to capture a significant percentage of the total variance, such as 95% or 99%.

Finally, project the original data onto the selected N principal components. This projection transforms the data into a lower-dimensional space, preserving the maximum variance and thus retaining as much of the original information as possible.

Projection and Reconstruction

After performing Principal Component Analysis (PCA) and identifying the principal components, the next steps involve projecting the original data onto these components and potentially reconstructing the original data from the reduced dimensions.

Consider our data matrix X, which has dimensions n x p (where n is the number of samples and p is the number of features). Apply PCA to obtain the k principal components. Let V be the resulting matrix of eigenvectors, with dimensions p x k.

To project the original data X onto the k-dimensional subspace spanned by the principal components, the projected data matrix Z is given by Z = XV. Here, Z has dimensions n x k, representing the data in the reduced k-dimensional space.

To reconstruct the original data matrix X from the reduced data matrix Z, use the transpose of the eigenvector matrix V. The reconstructed data matrix X’ is given by

Here, transposed V has dimensions k x p, and X’ will have dimensions n x p, matching the original data matrix.

If the total variance captured by the selected k components is less than 100% of the original variance (which is usually the case), the reconstruction X’ will not be perfect. Some information is inevitably lost during dimensionality reduction, leading to an approximation rather than an exact reconstruction. The quality of the reconstruction depends on how much of the total variance is captured by the chosen principal components.

Singular Value Decomposition

Singular Value Decomposition (SVD) is another powerful technique for dimensionality reduction that is widely used in data analysis and machine learning. It decomposes a matrix into three other matrices, capturing the essential patterns and structures within the original data. While PCA is a statistical technique focused on maximizing variance, SVD is a linear algebra technique that directly decomposes the original data matrix. It does not require the calculation of the covariance matrix.

The goal of SVD is to maximize the "information" captured by the singular values, where the “information” is defined as the sum of the squared singular values. The singular values represent the magnitude of the data along the corresponding singular vectors. Therefore, selecting the top k singular values and vectors ensures that the reconstruction retains the maximum possible energy of the original matrix within the kk-dimensional subspace.

Calculation

First, we perform matrix decomposition. Given the data matrix X, the goal is to find the matrices U,Σ, and transposed V. This can be done through a factorization process, the most famous algorithm to do it is the Golub-Reinsch[3], which is a direct method that doesn’t do any iteration. Once the decomposition is complete, matrix Σ contains the singular values on its diagonal, sorted in descending order and the matrices U and

contain the left and right singular vectors, respectively.

Next, we select the top k singular values from Σ and their corresponding singular vectors from U and V, similar to how we choose eigenvalues and vectors in PCA. The value of k is chosen based on the desired level of dimensionality reduction and the amount of information to retain. This involves retaining the largest singular values, which capture the most significant patterns in the data. So, to form the reduced representation of the data, we use the top k singular values and vectors. The reduced data matrix X_k can be constructed as:

where U_k contains the first k columns of U and Σ_k is a k x k diagonal matrix with the top k singular values.

Projection and Reconstruction

Once the data is reduced using SVD, it can be reconstructed by multiplying the calculated matrices. The reconstruction process is like projection but we also multiply it by the third (k-truncated) matrix:

As with PCA, if k is less than the original number of dimensions, the reconstruction will not be perfect but will retain the most significant information captured by the top k components. In the following figure you can see an example of an image reduced and reconstructed with SVD where we only keep the first 50 singular components.

SVD reduction and reconstruction over an image

Properties

Both methods are linear dimensionality reduction, do they share many aspects. Indeed, PCA can be viewed as a special case of SVD. Specifically, PCA is typically performed on the covariance matrix of the data, which is a symmetric matrix. When you center the data matrix X (subtract the mean of each feature) and then perform SVD on this centered data matrix, the left singular vectors U correspond to the principal component directions, and the singular values in Σ are related to the eigenvalues of the covariance matrix.So, these methods have several strengths and weaknesses in common that make them suitable for specific applications while limiting its utility in others.

Strengths

No Need to Set Hyperparameters: they do not require the user to set hyperparameters, which simplifies its application and reduces the risk of suboptimal parameter choices.
No Iterations: Unlike many optimization algorithms, these techniques do not involve iterative procedures. This means they can be computationally efficient and straightforward to implement.
No Local Optima: They guarantees a global optimum since it relies on solving an eigenvalue problem. This avoids the pitfalls of algorithms that may converge to local optima, ensuring consistent and reliable results.

Weakness

Restriction to Linear Projections:
- The primary limitation of PCA and SVD is their restriction to linear projections. They can only capture linear relationships in the data. If the underlying structure of the data is non-linear, these methods might fail to effectively reduce the dimensionality while preserving important features.
- For instance, if the data lies on a curved manifold in the high-dimensional space, linear methods like PCA and SVD will not be able to uncover this structure. They would only find the best linear approximation, which might not be sufficient for capturing the essential characteristics of the data.

Conclusions

We have examined the two most widely used linear methods for dimensionality reduction: PCA and SVD. PCA is primarily used when the goal is to maximize the variance captured by the principal components, making it ideal for data with linear relationships where a straightforward interpretation of the importance of each component is desired. It is commonly applied in fields such as financial analysis [4] for risk management and portfolio optimization, as well as in data visualization to simplify the complexity of high-dimensional data.

On the other hand, SVD is a more general approach that applies directly to the data matrix and is particularly useful in natural language processing (NLP). For instance, SVD is used in topic analysis and latent semantic analysis (LSA)[5] to uncover the underlying structure in text data by identifying patterns and relationships between terms and documents.

Although both methods are powerful, they have limitations when dealing with data that exhibit non-linear relationships. In our next article, we will explore non-linear dimensionality reduction methods that allow us to overcome these limitations and capture more complex structures in the data. These methods will provide additional tools for effectively and robustly working with high-dimensional data.