Introduction to Data Visualization

Juan Manuel Ortiz de Zarate
Sep 3, 2024
9 min read

Data visualization is a critical tool in both professional and academic settings, essential for simplifying complex information and effectively communicating insights. In today's data-driven world, we often face vast amounts of data that can be difficult to interpret. Visualization techniques transform this data into visual formats, making it easier to understand and allowing us to quickly identify patterns, trends, and relationships that might be overlooked through traditional analysis.

One of the primary advantages of data visualization is its ability to enhance comprehension. When data is presented visually, it becomes much easier to detect outliers, trends, and correlations. For example, a scatter plot can illustrate the relationship between two variables, while a heatmap can highlight areas of high or low activity within a dataset. This visual approach enables analysts and decision-makers to interpret data more quickly and accurately.

Additionally, visualization plays a vital role in communication. In many cases, the results of data analysis need to be shared with others, whether they are colleagues, stakeholders, or a broader audience. Visuals such as charts and graphs can convey complex information in a clear and concise manner, making it easier for others to grasp the key points and make informed decisions.

Beyond comprehension and communication, data visualization is also essential for discovering hidden patterns within data. Often, the most valuable insights emerge when data is visualized. For instance, trends over time become clearer in a line graph, and differences between groups may be more apparent in a bar chart. These visual insights can lead to new hypotheses and guide further analysis.

In the realm of data science and research, descriptive analysis is a cornerstone, relying heavily on the aggregation, summarization, and visualization of data. These processes allow for a clearer understanding of large datasets, enabling more effective decision-making and communication.

As datasets become increasingly complex, particularly those involving multiple variables, the role of visualization becomes even more crucial. Traditional methods of data analysis may struggle with such complexity, but advanced visualization techniques can help us explore and understand multidimensional data, even within the limitations of two-dimensional representation.

It's important to recognize that charts and graphs are not just "pretty pictures"; they are powerful tools for conveying information. They can reveal insights that might be missed in raw data or statistical analysis alone. For example, an infographic can simplify a complex report into a more accessible format, making the information easier to understand for a wider audience.

Datasaurus

Alberto Cairo[2], a renowned figure in the field of data visualization, created the "Datasaurus"[1] dataset to emphasize a critical lesson in data analysis: "never trust summary statistics alone; always visualize your data." This dataset is a compelling example that illustrates the potential pitfalls of relying solely on numerical summaries without considering the visual representation of data.

The Datasaurus dataset consists of a collection of bi-dimensional datasets. Intriguingly, each of these datasets shares nearly identical statistical estimators, including the same mean, variance, and correlation values. At first glance, based on these summary statistics, one might assume that the datasets are nearly identical in nature. However, the true differences between these datasets only become apparent when they are plotted on a scatter plot.

All the Datasaurus datasets have the same statistical estimators (if you round them with the first decimal)

Upon visualizing these datasets, it becomes clear that despite having similar statistical properties, their visual forms are dramatically different. This striking contrast highlights the importance of data visualization in uncovering patterns, structures, and anomalies that summary statistics alone cannot reveal.

The name "Datasaurus" is derived from one of the scatter plots in the dataset, which amusingly forms the shape of a dinosaur. This visual serves as a memorable reminder of the dataset’s core message: even when data appears similar based on statistical measures, its visual representation can tell a very different story. The Datasaurus example underscores the value of combining statistical analysis with visual exploration to gain a complete and accurate understanding of data.

Types of charts

When working with data visualization, it's crucial to select the appropriate type of graph based on the nature of the data and the insights you wish to convey. There are various types of graphs, each suited to different kinds of data and analytical goals. Understanding these categories will help you choose the most effective visualization for your data.

Continuous Distribution Graphs: These graphs are used to represent data that can take on any value within a range, meaning the data points are part of a continuous scale. Common examples include histograms and density plots. These types of graphs are particularly useful for visualizing the distribution of variables like height, weight, or temperature, where the data is not restricted to distinct categories but can vary smoothly across a spectrum.
Discrete Distribution Graphs: In contrast to continuous data, discrete data consists of distinct and separate values, often representing counts or categories. Bar charts and pie charts are typical examples of graphs used to visualize discrete distributions. These graphs are ideal for data that can be counted in whole numbers, such as the number of students in different classes or the frequency of specific categories in a dataset.
Relationship Graphs: Relationship graphs are designed to explore the connections between two or more variables. Scatter plots, bubble charts, and correlation matrices are commonly used in this category. These graphs are particularly valuable when analyzing how one variable might influence another, allowing you to identify correlations, trends, or clusters within the data.
Time Series Graphs: When your data involves variables that change over time, time series graphs are the go-to option. Line charts, area charts, and candlestick charts are typical examples. These graphs are excellent for tracking trends, cycles, and patterns across time periods, whether you're monitoring stock prices, sales figures, or temperature changes over time.

Choosing the right type of graph is essential for effective data visualization, ensuring that the information is presented in a way that is both accurate and easily interpretable by the intended audience. These categories are just one of the ways to categorize plots, there are more categories and graph types, but in this article, we will cover only the most common ones. Here you can check more types of graphs [4].

In the following sections, we will see some examples of the first 4 categories, so you will have a set of possible graphs to use on your next data analysis task.

Continuous Distribution Graphs

Histograms

Histograms are a type of bar graph used to represent the distribution of continuous data. Unlike typical bar charts, which display categorical data, histograms group continuous data points into ranges, or "bins," and display the frequency of data points within each bin. This approach allows for a clear visual representation of how data is distributed across different intervals, making it easier to identify patterns such as skewness, modality, and the presence of outliers.

Histogram example over a synthetic dataset of salaries

Through histograms, we can quickly grasp the overall shape of the data distribution. They are particularly useful for spotting whether data is normally distributed, skewed, or if there are any significant gaps or anomalies. Bins in a histogram are the intervals into which the continuous data is grouped. The width of each bin plays a critical role in determining the histogram's appearance and the insights it provides. Narrow bins reveal more detailed information but can result in a cluttered visualization, while wider bins offer a broader view of the data but may mask important details. Choosing the right bin width is essential to accurately representing the data.

Density plot

They serve a similar purpose as histograms but offer a smoothed, continuous curve that estimates the probability density function of the data. This curve provides a fluid visualization of the data’s distribution, making it easier to discern the overall shape and patterns. The smoothing effect of density plots can make the data appear more continuous and easier to interpret, but it also comes with the risk of oversimplification. Over-smoothing can hide significant details or create misleading impressions of the data. Therefore, it's important to carefully adjust the smoothing to balance clarity with accuracy.

Angular density plots are a specialized form of density plots used for circular or directional data, such as time-of-day patterns or compass directions. These plots can help to reveal cyclical patterns but require a good understanding of the data's circular nature for accurate interpretation.

Google searches in the United States on each emotion. Areas show the proportion of total searches that happened each hour. Source: Google Trends [3]

Boxplots

Boxplots, or box-and-whisker plots, offer a succinct visual summary of a dataset’s distribution by highlighting key summary statistics. The central feature of a boxplot is the "box," which spans from the first quartile (Q1) to the third quartile (Q3), capturing the interquartile range (IQR). A line inside the box indicates the median. The "whiskers" extend from the box, typically representing the range within 1.5 times the IQR from the quartiles, although this can vary. Any data points beyond the whiskers are plotted as outliers, which can indicate anomalies or interesting deviations in the data. Boxplots are especially useful for comparing the distribution of multiple datasets, as they clearly show differences in central tendency, variability, and outliers across groups.

A Boxplot example over a synthetic dataset of salaries

Discrete Distribution Graphs

Pie charts

They are among the most common types of graphs used to represent discrete data, yet they are also frequently misused. Pie charts are designed to show the proportions of a small number of categories within a whole. They are best suited for discrete variables with low cardinality, meaning they should only be used when there are a few distinct categories to represent. This allows the viewer to easily compare the size of each slice and understand the proportion that each category contributes to the total.

Pie chart example of the gender distribution in the US based on Statisa.com public information [5]

However, pie charts become problematic when used with too many categories. As the number of slices increases, the differences between them become harder to distinguish, leading to confusion and misinterpretation. For this reason, it’s important to avoid using pie charts in cases where the data involves numerous categories or when precise comparison between categories is required.

Bar charts

On the other hand, are more versatile and better suited for representing variables with higher cardinality. Unlike pie charts, bar charts can handle a larger number of categories without sacrificing clarity. Each bar represents a category, and the length or height of the bar corresponds to the value or frequency of that category. This makes it easy to compare the size of different categories directly.

Bar chart example of how GPT affects Productivity, Speed, and Quality work. Source[6]

Moreover, bar charts can be adapted to display the relationship between two variables at once through the use of stacked or grouped bar plots. Stacked bar charts allow you to show the composition of each category in terms of sub-categories, while grouped bar charts place bars side by side to facilitate comparison between groups across different categories. These variations of bar charts provide a flexible and powerful way to visualize more complex relationships within discrete data, making them a preferred choice when dealing with larger and more detailed datasets.

Stacked Bar chart example of how AI adoption affects revenue. Source[6]

Relation Graphs

Scatter plots

They are a fundamental tool for visualizing relationships between two continuous variables. They are used when you want to explore how one variable may affect or relate to another. In a scatter plot, each point on the graph represents an observation in the dataset, with its position determined by the values of the two variables being compared. The x-axis represents one variable, and the y-axis represents the other. Scatter plots are particularly useful for identifying correlations, trends, clusters, and outliers within the data. For example, a scatter plot can show a positive or negative correlation between two variables, such as the relationship between temperature and ice cream sales.

Scatter plot example of LLm prices vs Elo score. Source[7]

Scatter plots are typically graphed by plotting individual data points on a two-dimensional grid. If a pattern emerges, such as a linear trend or a clear grouping of points, it can suggest a relationship between the variables. Conversely, a scatter plot with no discernible pattern might indicate that the variables are not related. This type of plot is essential for exploratory data analysis, allowing for the quick identification of relationships and guiding further statistical analysis.

Heatmaps

Heatmaps are another powerful tool for exploring relationships, but they are particularly suited for comparing distributions where both axes represent discrete variables, and a third dimension, often represented by color intensity, shows a numerical value. Heatmaps are commonly used to visualize the magnitude of a value within each combination of categories on the x and y axes. This "depth" dimension usually arises from calculating an aggregate statistic, such as a sum, average, or count, for the group corresponding to each cell in the grid.

Heatmap chart example of AI-related roles and hiring by industry sector. Source[6]

In a heatmap, the grid cells are colored based on the value they represent, with a gradient or distinct colors indicating higher or lower values. This makes it easy to spot areas of high concentration, trends, or anomalies across the two categorical variables. For example, a heatmap might be used to display the frequency of interactions between different categories of users and products or to show the correlation between two categorical variables like day of the week and time of day, with the intensity of the color indicating the strength or frequency of occurrences. Heatmaps are especially effective when you need to analyze complex relationships across multiple dimensions in a visually intuitive way.

Time Series

Time series data is collected at successive points in time, often at regular intervals, and is crucial for analyzing trends, patterns, and changes over time. Typically visualized with line charts, time series analysis is widely used in fields like finance, economics, and weather forecasting to track and predict variables such as stock prices, sales, or temperature changes.

Seasonal time series are a subset that shows regular, repeating patterns or cycles over specific periods, like days, months, or years. These patterns are driven by factors like weather or holidays, and understanding them is key for accurate forecasting. Recognizing seasonality helps distinguish between true trends and cyclical variations, enabling better decision-making in industries affected by seasonal changes.

Conclusions

In conclusion, data visualization is an essential tool for effectively analyzing and communicating complex information. From understanding distributions with histograms and density plots to exploring relationships with scatter plots and heatmaps, the choice of the right visualization technique is crucial for uncovering meaningful insights. Time series analysis, particularly with seasonal data, highlights the importance of considering temporal patterns in forecasting and decision-making.

Each type of visualization serves a specific purpose, helping to clarify different aspects of data. Whether it’s simplifying large datasets, identifying trends, or comparing categories, the ability to visualize data accurately is key to making informed decisions. By integrating visual tools into data analysis, we not only enhance our understanding but also improve our ability to communicate findings clearly and effectively. As data continues to grow in both volume and complexity, mastering these visualization techniques becomes increasingly important for anyone working with data.