In the quest to understand and interpret high-dimensional data, dimensionality reduction plays a pivotal role. High-dimensional data presents a unique set of challenges; each additional dimension can exponentially increase the complexity, making visualization and analysis daunting. As we saw in our previous article[__2__], traditional linear methods, such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), project data into lower dimensions while preserving as much variance or information as possible. However, these methods often fall short when dealing with the intricate, non-linear structures inherent in many real-world datasets.

This is where non-linear dimensionality reduction techniques come into play. Unlike linear methods, non-linear techniques can capture and represent complex relationships within high-dimensional spaces. These methods are adept at preserving the true geometric structure of the data, making them invaluable for tasks where the underlying manifold structure is non-linear and intricate. By leveraging these advanced techniques, we can achieve a more accurate and insightful representation of high-dimensional datasets.

In this article, we delve into three prominent non-linear dimensionality reduction techniques: Multidimensional Scaling (MDS), Isometric Mapping (Isomap), and t-Distributed Stochastic Neighbor Embedding (t-SNE). Each of these methods brings a unique approach to uncovering the underlying patterns and structures within data, offering powerful tools for data scientists and machine learning practitioners.

MDS provides a way to maintain distances between data points in the reduced dimension, making it an excellent choice for exploratory data analysis. Isomap extends this approach by incorporating the geodesic distance between data points, which is particularly effective for unfolding non-linear manifolds into lower dimensions. Meanwhile, t-SNE is specifically designed for creating visually interpretable maps from high-dimensional data, capturing local structures within the data effectively.

Through a detailed exploration of these methods, we will uncover how they work, their applications, and the advantages they offer in handling complex datasets. By understanding these non-linear methods, we can better appreciate their ability to reveal hidden structures in high-dimensional data, leading to more insightful analyses and informed decision-making.

Join us as we navigate the fascinating landscape of non-linear dimensionality reduction, exploring the mathematical foundations, practical implementations, and the transformative potential these techniques hold for data science and beyond. By the end of this article, you will gain a comprehensive understanding of how non-linear dimensionality reduction can enhance your ability to analyze and interpret complex datasets, paving the way for new discoveries and advancements in various fields.

## Multidimensional scaling

It is another prominent technique for dimensionality reduction, with a different core objective compared to PCA. The primary idea behind MDS is to preserve the distances between points in the high-dimensional space when projecting them onto a lower-dimensional space.

The goal of this technique is to place the points in a lower-dimensional space such that the pairwise distances between points in this space are as close as possible to the pairwise distances in the original high-dimensional space. This preservation of distances helps in maintaining the structure of the data, making MDS particularly useful for visualization and understanding the inherent relationships in the dataset.

### Calculation

Given a data matrix X with *r* rows (samples) and *p* columns (features), the goal is to find a lower-dimensional representation that preserves the distances between the data points as similar as possible concerning the original dimension.

First, calculate the distance matrix D, where each element Dij represents the distance between the data points Xi and Xj. Typically, the Euclidean distance is used, but other metrics can also be applied depending on the specific needs of the analysis.

Next, using the distance matrix D, we aim to find a new matrix M with *r* rows and *n* columns, where *n* is less than *p*. The objective is to ensure that the distances between points in this new n-dimensional space are as close as possible to the original distances in D.

The process of finding the lower-dimensional matrix M is iterative, as it involves optimizing the positions of the points in the reduced space to best match the original distances. This optimization often uses algorithms like stress majorization[__1__], which iteratively adjusts the positions to reduce the difference between the distances in the high-dimensional and low-dimensional spaces. The iterative nature of this optimization ensures that the final configuration of points in M closely preserves the original distance relationships.

### Example

To illustrate the capabilities of MDS, let's consider a practical example: reconstructing the geographic layout of cities in the United States using only the distances between them. This example demonstrates how MDS can transform a matrix of distances into a meaningful 2-dimensional representation, revealing the spatial relationships between cities.

Using the distance matrix as input, MDS algorithm places each city in a 2-dimensional space in such a way that the pairwise distances between the cities in this 2D space approximate the original distances in the matrix. The algorithm works iteratively to minimize the difference between the original distances and the distances in the reduced space.

The resulting 2-dimensional plot shows New York and Boston close together, reflecting their actual geographic proximity, while Los Angeles would be positioned far from these cities, accurately representing the distance. The plot provides a visual approximation of the cities’ layout, offering insights into their relative distances and geographic relationships.

### Properties

MDS is a versatile and robust technique for dimensionality reduction, characterized by several notable strengths and some inherent weaknesses. Understanding these properties is crucial for effectively applying MDS to various datasets and interpreting the results accurately.

#### Strengths

__Supports Various Types of Distances__: MDS is flexible in its ability to handle different types of distance measures. Whether working with Euclidean distances, Manhattan distances, or more complex dissimilarity measures, MDS can process and transform them into a lower-dimensional space. This adaptability makes it suitable for a wide range of applications, from geographic data to genetic sequences, where the nature of the distance measure can vary significantly.__Enables Non-Linear Transformations__: Unlike linear dimensionality reduction techniques, MDS is capable of capturing non-linear relationships within the data. By focusing on preserving the pairwise distances between data points, MDS can reveal the underlying structure of the data that may not be apparent through linear transformations. This ability is particularly valuable when dealing with complex datasets where the relationships between variables are inherently non-linear.

#### Weaknesses

__Iterative Optimization with Local Minima__: The process of finding the optimal configuration in MDS involves iterative optimization, which can sometimes converge to local minima rather than the global minimum. This means that the solution found by MDS might not be the best possible representation of the data in lower dimensions. The quality of the result can be influenced by the initial configuration and the specific optimization algorithm used.__Difficulty in Determining the Best Distance Measure__: One of the challenges in applying MDS is selecting the most appropriate distance measure for a given dataset. Different distance measures can lead to different configurations, and there is often no clear criterion for choosing the best one. This can make the application of MDS somewhat subjective and dependent on domain knowledge and experimentation.

## Isometric Mapping

It extends the principles of MDS by preserving the intrinsic geometry of the data. It does this by utilizing the geodesic distance rather than the straight-line Euclidean distance between points. This approach allows Isomap to effectively handle complex, non-linear structures in the data by mapping them into a lower-dimensional space while maintaining their true geometric properties.

The core idea behind Isomap is to preserve the geometry of the underlying manifold on which the data lies. This is achieved by calculating the geodesic distances between data points, which are the shortest paths along the manifold, rather than the direct Euclidean distances. By doing so, Isomap captures the true structure of the data, allowing for more accurate dimensionality reduction.

### Calculation

The algorithm is a widely used technique for manifold learning, which aims to reduce the dimensionality of data while preserving the geometric structure. Here's the step-by-step pseudocode for the Isomap algorithm:

__Determine the nearest neighbors for each point__: For each data point, find its*k*nearest neighbors based on Euclidean distance.__Construct a neighborhood graph__: Create a graph where each point is connected to its nearest neighbors with edges weighted by the Euclidean distance.__Compute the shortest path between all pairs of points__: Use the Floyd-Warshall[__5__] algorithm to find the shortest paths between all pairs of points in the graph.__Generate a distance matrix and apply MDS__: Construct a distance matrix from the shortest paths and apply MDS to the distance matrix to obtain the low-dimensional embedding.

#### Pseudocode

`# Step 1: Determine the nearest neighbors for each point`

`for each point p in points:`

` neighbors[p] = find_nearest_neighbors(p, points, k) # k is the number of neighbors`

`# Step 2: Construct a neighborhood graph`

`graph = initialize_empty_graph()`

`for each point p in points:`

` for each neighbor in neighbors[p]:`

` distance = euclidean_distance(p, neighbor)`

` graph.add_edge(p, neighbor, distance)`

`# Step 3: Compute the shortest path between all pairs of points using Floyd-Warshall algorithm`

`shortest_paths = floyd_warshall(graph)`

`# Step 4: Generate a distance matrix and apply Multidimensional Scaling (MDS)`

`distance_matrix = initialize_empty_matrix(num_points, num_points)`

`for each point p1 in points:`

` for each point p2 in points:`

` distance_matrix[p1][p2] = shortest_paths[p1][p2]`

`mds_result = apply_mds(distance_matrix)`

### Example

To illustrate how Isomap works, let's refer to the following figure:

__Panel A__: This panel shows a 3-dimensional dataset with a spiral structure. The distance between two points can be measured either directly through the space (dotted line) or along the manifold (solid line). The direct Euclidean distance (dotted line) fails to capture the actual relationship between points on the manifold.__Panel B__: Here, the red line represents the geodesic distance, which follows the structure of the manifold. This distance is computed by hopping from one neighboring node to another, effectively tracing the shortest path along the curved surface of the manifold.__Panel C__: After applying Isomap, the data is reduced to two dimensions. This 2D representation maintains the geometric relationships captured by the geodesic distances. The result is a flattened but accurate depiction of the original manifold, preserving the essential structure and distances between points.

By preserving the manifold's geometry, Isomap provides a meaningful low-dimensional representation that reflects the true nature of the data.

### Properties

The Isomap algorithm is a powerful technique for manifold learning that comes with its own set of strengths and weaknesses. Here we explore the properties of the Isomap algorithm in terms of its strengths and weaknesses.

#### Strengths

__Preserves Manifold Structure__: Isomap excels at maintaining the intrinsic geometric structure of the manifold. By using geodesic distances instead of straight-line Euclidean distances, Isomap can capture the true underlying shape of the data. This is particularly beneficial for data that lies on a curved manifold, allowing for a more accurate low-dimensional representation.__Enables Non-Linear Transformations__: As MDS, it can handle non-linear transformations.

#### Weaknesses

__Requires Determination of____k____Nearest Neighbors__: One of the main challenges with Isomap is the need to specify the number of nearest neighbors,*k*, for constructing the neighborhood graph. The choice of*k*can significantly impact the results. A value too small might result in a disconnected graph, while a value too large might ignore the local structure.__Sensitive to Outliers and Noise__: Isomap is quite sensitive to outliers and noise in the data. Outliers can distort the neighborhood graph and subsequently the computed geodesic distances, leading to an inaccurate embedding. Similarly, noise can affect the distance calculations, compromising the quality of the manifold representation.

## t-SNE

It is a powerful technique for dimensionality reduction, particularly well-suited for visualizing high-dimensional data. Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE has gained significant popularity due to its ability to reveal intricate structures in data that are often hidden in high-dimensional spaces. Like Isomap and Multidimensional Scaling (MDS), t-SNE is a manifold learning technique designed to preserve the structure of high-dimensional data when mapped to a lower-dimensional space. These methods share several core similarities.

First, all three techniques aim to maintain the relationships between data points in the high-dimensional space when projecting them into a lower-dimensional space. This makes them effective for visualizing complex datasets, revealing clusters, and identifying patterns. Second, t-SNE, Isomap, and MDS are capable of capturing non-linear relationships within the data. Third, while MDS focuses on preserving global distances and Isomap uses geodesic distances to capture the global manifold structure, t-SNE emphasizes preserving local relationships.

Despite these similarities, t-SNE differs significantly from Isomap and MDS in its approach and application. It uses a probabilistic approach to minimize the divergence between two distributions: one representing pairwise similarities in the high-dimensional space and the other in the low-dimensional space. This focus on local similarities allows t-SNE to create more interpretable and visually appealing maps, especially when dealing with large and complex datasets.

### Calculation

A crucial aspect of t-SNE is the concept of perplexity, which plays a significant role in determining the balance between local and global aspects of the data during the dimensionality reduction process.

To begin, t-SNE calculates pairwise similarities between data points in the high-dimensional space. This is done by converting the Euclidean distances into conditional probabilities that represent the likelihood of one point being a neighbor of another. Here, the perplexity parameter is introduced, which can be thought of as a smooth measure of the effective number of neighbors for each point. Mathematically, perplexity is defined as

𝐻(𝑃) is the Shannon entropy [__6__] of the conditional probabilities. Perplexity is chosen by the user and typically ranges between 5 and 50 [__7__]. It influences the bandwidth of the Gaussian kernels used to calculate these probabilities, thereby controlling how local or global the neighborhood structure will be.

Once the high-dimensional pairwise similarities are calculated, it initializes the points in the lower-dimensional space, typically using random initialization. The next step involves defining a similar probability distribution in this lower-dimensional space. This step is similar to what MDS and Isomap do, create a new matrix for the reduced space and make it similar to the original one.

The core of t-SNE is an optimization process that minimizes the Kullback-Leibler (KL) divergence between the high-dimensional and low-dimensional probability distributions. This is achieved using gradient descent. The algorithm iteratively adjusts the positions of points in the lower-dimensional space to reduce the divergence, ensuring that points that are close neighbors in the high-dimensional space remain close in the low-dimensional space, and vice versa. The perplexity parameter significantly affects this process by determining the effective neighborhood size, striking a balance between focusing on very local structures and incorporating broader global information.

### Example

The following image illustrates the application of t-SNE to the MNIST dataset [__8__], a widely used benchmark in machine learning consisting of 70,000 images of handwritten digits (0-9). Each data point in the original high-dimensional space represents the pixel values of a 28x28 grayscale image. The goal of applying t-SNE to this dataset is to reduce its dimensionality while preserving the relationships between data points, allowing us to visualize the inherent structure of the dataset in a two-dimensional space.

In the visualization, each point represents a single image from the MNIST dataset, and the color of each point corresponds to the digit label (0-9). The distinct clusters observed in the image indicate that t-SNE has successfully captured the local and global structures of the data. Points representing the same digit are grouped together, forming clear and separate clusters for each digit. This clustering occurs because t-SNE focuses on preserving the local similarities and the neighborhood relationships of the data points.

### Properties

As the previous methods, it has its own strengths and weaknesses.

#### Strengths

__Excellent for Data Visualization__: t-SNE is renowned for its ability to create clear and interpretable visualizations of complex datasets.__Preserves Global and Local Non-Linear Structures__: One of the key advantages of t-SNE is its ability to maintain both global and local non-linear structures within the data.

#### Weaknesses

__Stochastic Nature__: t-SNE is inherently stochastic, meaning that different runs of the algorithm with the same data and parameters can produce different results. This variability can make it challenging to achieve consistent visualizations and requires careful consideration when interpreting the results.__Scalability Issues__: The algorithm's computational complexity increases significantly with the number of data points and the dimensionality of the data. As a result, t-SNE can be time-consuming and resource-intensive for very large datasets, making it less practical for real-time applications or very large-scale data.__Inapplicability to New Data Points__: A notable limitation of t-SNE is that it does not provide a straightforward way to embed new data points into an existing visualization. Each new point would require running the algorithm again with the entire dataset, which is computationally expensive and impractical for dynamic datasets.

## Conclusions

In this article, we explored various non-linear dimensionality reduction techniques: MDS, Isomap, and t-SNE. Each of these methods offers unique strengths and addresses different challenges associated with visualizing and analyzing high-dimensional data.

MDS was our starting point, providing a foundation for understanding how distances between data points can be preserved when projecting them into a lower-dimensional space. MDS is particularly useful for its straightforward approach to preserving global distance relationships, making it a versatile tool for a variety of applications. However, its linear nature can limit its effectiveness for data that lies on a non-linear manifold.

Isomap extends the capabilities of MDS by incorporating geodesic distances, thus preserving the intrinsic geometry of data that resides on a non-linear manifold. Through its use of nearest neighbors and shortest path algorithms, Isomap captures both local and global structures more effectively than MDS. This makes it a valuable technique for datasets where the underlying structure is complex and non-linear. However, Isomap's sensitivity to the choice of neighbors and to outliers can pose challenges.

t-SNE takes a different approach by focusing on preserving local similarities while allowing for non-linear transformations. It excels in creating visually interpretable maps that reveal clusters and patterns in the data, making it one of the best techniques for visualizing high-dimensional data. The use of perplexity to balance local and global relationships is a notable feature of t-SNE. Despite its strengths, t-SNE's stochastic nature, scalability issues, and inability to handle new data points without re-computing the entire embedding are significant limitations.

In conclusion, non-linear dimensionality reduction techniques like these ones provide powerful tools for visualizing and understanding high-dimensional data. Each method has its own strengths and weaknesses, making them suitable for different types of data and specific analytical needs. Understanding these techniques and their properties enables data scientists and researchers to select the most appropriate method for their data, ultimately enhancing the insights and interpretations derived from complex datasets. As the field of data visualization continues to evolve, these techniques will remain essential components of the data analyst's toolkit.

## References

[1] Stress majorization __https://en.wikipedia.org/wiki/Stress_majorization__

[2] Dimensionality Reduction: Linear methods __https://www.transcendent-ai.com/post/dimensionality-reduction-linear-methods__

[3] MITx: Statistics, Computation & Applications

[4] Wang, H., Cai, T., Cheng, D., Li, K., & Zhou, Y. (2024). A Classification Algorithm Based on Improved Locally Linear Embedding. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 18(1), 1-9.

[5] Floyd, R. W. (1962). Algorithm 97: shortest path. Communications of the ACM, 5(6), 345-345.

[6] Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review, 5(1), 3-55.

[7] Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).

[8] Recognizing hand-written digits __https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html__

[9] Wu, J. (2023). Music Recommendation Index Evaluation Based on Logistic Distribution Fitting Transition Probability Function. Applied Mathematics and Nonlinear Sciences, 8(1), 1769-1776.

## Comments