A Data Exploration and Visualization
The following code covers the important part of data exploration. For space reasons, this chapter was moved from the printed textbook to this Data Exploration Web Chapter.
Packages Used in this Chapter
pkgs <- c("arules", "GGally",
"ggcorrplot", "hexbin", "palmerpenguins", "plotly", "seriation", "tidyverse")
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)
The packages used for this chapter are:
- arules (Hahsler et al. 2024)
- GGally (Schloerke et al. 2024)
- ggcorrplot (Kassambara 2023)
- hexbin (Carr, Lewin-Koh, and Maechler 2024)
- palmerpenguins (Horst, Hill, and Gorman 2022)
- plotly (Sievert et al. 2024)
- seriation (Hahsler, Buchta, and Hornik 2024)
- tidyverse (Wickham 2023c)
We will use again the iris dataset.
library(tidyverse)
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
A.1 Exploring Data
A.1.1 Basic statistics
Get summary statistics (using base R)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Get mean and standard deviation for sepal length.
iris |>
summarize(avg_Sepal.Length = mean(Sepal.Length),
sd_Sepal.Length = sd(Sepal.Length))
## # A tibble: 1 × 2
## avg_Sepal.Length sd_Sepal.Length
## <dbl> <dbl>
## 1 5.84 0.828
Data with missing values will result in statistics of NA
. Adding the
parameter na.rm = TRUE
can be used in most statistics functions to
ignore missing values.
Outliers are typically the smallest or the largest values of a feature. To make the mean more robust against outliers, we can trim 10% of observations from each end of the distribution.
iris |>
summarize(
avg_Sepal.Length = mean(Sepal.Length),
trimmed_avg_Sepal.Length = mean(Sepal.Length, trim = .1)
)
## # A tibble: 1 × 2
## avg_Sepal.Length trimmed_avg_Sepal.Length
## <dbl> <dbl>
## 1 5.84 5.81
Sepal length does not have outliers, so the trimmed mean is almost identical.
To calculate a summary for a set of features (e.g., all numeric
features), tidyverse provides across(where(is.numeric), fun)
.
iris |> summarize(across(where(is.numeric), mean))
## # A tibble: 1 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.84 3.06 3.76 1.20
iris |> summarize(across(where(is.numeric), sd))
## # A tibble: 1 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 0.828 0.436 1.77 0.762
iris |> summarize(across(where(is.numeric),
list(min = min,
median = median,
max = max)))
## # A tibble: 1 × 12
## Sepal.Length_min Sepal.Length_median Sepal.Length_max
## <dbl> <dbl> <dbl>
## 1 4.3 5.8 7.9
## # ℹ 9 more variables: Sepal.Width_min <dbl>,
## # Sepal.Width_median <dbl>, Sepal.Width_max <dbl>,
## # Petal.Length_min <dbl>, Petal.Length_median <dbl>,
## # Petal.Length_max <dbl>, Petal.Width_min <dbl>,
## # Petal.Width_median <dbl>, Petal.Width_max <dbl>
The median absolute deviation (MAD) is another measure of dispersion.
A.1.2 Grouped Operations and Calculations
We can use the nominal feature to form groups and then calculate group-wise statistics for the continuous features. We often use group-wise averages to see if they differ between groups.
iris |>
group_by(Species) |>
summarize(across(Sepal.Length, mean))
## # A tibble: 3 × 2
## Species Sepal.Length
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
iris |>
group_by(Species) |>
summarize(across(where(is.numeric), mean))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versico… 5.94 2.77 4.26 1.33
## 3 virgini… 6.59 2.97 5.55 2.03
We see that the species Virginica has the highest average for all, but Sepal.Width.
The statistical difference between the groups can be tested using ANOVA (analysis of variance).
res.aov <- aov(Sepal.Length ~ Species, data = iris)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.2 31.61 119 <2e-16 ***
## Residuals 147 39.0 0.27
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862 1.1738 0
## virginica-setosa 1.582 1.3382 1.8258 0
## virginica-versicolor 0.652 0.4082 0.8958 0
The summary shows that there is a significant difference for
Sepal.Length between the groups. TukeyHDS
evaluates differences
between pairs of groups. In this case, all are significantly different.
If the data only contains two groups, the t.test
can be used.
A.1.3 Tabulate data
We can count the number of flowers for each species.
iris |>
group_by(Species) |>
summarize(n())
## # A tibble: 3 × 2
## Species `n()`
## <fct> <int>
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
In base R, this can be also done using count(iris$Species)
.
For the following examples, we discretize the data using cut.
iris_ord <- iris |>
mutate(across(where(is.numeric),
function(x) cut(x, 3, labels = c("short", "medium", "long"),
ordered = TRUE)))
iris_ord
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <ord> <ord> <ord> <ord> <fct>
## 1 short medium short short setosa
## 2 short medium short short setosa
## 3 short medium short short setosa
## 4 short medium short short setosa
## 5 short medium short short setosa
## 6 short long short short setosa
## 7 short medium short short setosa
## 8 short medium short short setosa
## 9 short medium short short setosa
## 10 short medium short short setosa
## # ℹ 140 more rows
summary(iris_ord)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## short :59 short :47 short :50 short :50
## medium:71 medium:88 medium:54 medium:54
## long :20 long :15 long :46 long :46
## Species
## setosa :50
## versicolor:50
## virginica :50
Cross tabulation is used to find out if two discrete features are related.
tbl <- iris_ord |>
select(Sepal.Length, Species) |>
table()
tbl
## Species
## Sepal.Length setosa versicolor virginica
## short 47 11 1
## medium 3 36 32
## long 0 3 17
The table contains the number of rows that contain the combination of values (e.g., the number of flowers with a short Sepal.Length and species Setosa is 47). If a few cells have very large counts and most others have very low counts, then there might be a relationship. For the iris data, we see that species Setosa has mostly a short Sepal.Length, while Versicolor and Virginica have longer sepals.
Creating a cross table with tidyverse is a little more involved and uses pivot operations and grouping.
iris_ord |>
select(Species, Sepal.Length) |>
### Relationship Between Nominal and Ordinal Features
pivot_longer(cols = Sepal.Length) |>
group_by(Species, value) |>
count() |>
ungroup() |>
pivot_wider(names_from = Species, values_from = n)
## # A tibble: 3 × 4
## value setosa versicolor virginica
## <ord> <int> <int> <int>
## 1 short 47 11 1
## 2 medium 3 36 32
## 3 long NA 3 17
We can use a statistical test to determine if there is a significant relationship between the two features. Pearson’s chi-squared test for independence is performed with the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals. The null hypothesis h0 is independence between rows and columns.
tbl |>
chisq.test()
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 112, df = 4, p-value <2e-16
The small p-value indicates that the null hypothesis of independence needs to be rejected. For small counts (cells with counts <5), Fisher’s exact test is better.
fisher.test(tbl)
##
## Fisher's Exact Test for Count Data
##
## data: tbl
## p-value <2e-16
## alternative hypothesis: two.sided
A.1.4 Percentiles (Quantiles)
Quantiles are cutting points dividing the range of a probability distribution into continuous intervals with equal probability. For example, the median is the empirical 50% quantile dividing the observations into 50% of the observations being smaller than the median and the other 50% being larger than the median.
By default quartiles are calculated. 25% is typically called Q1, 50% is called Q2 or the median and 75% is called Q3.
The interquartile range is a measure for variability that is robust against outliers. It is defined the length Q3 - Q2 which covers the 50% of the data in the middle.
A.1.5 Correlation
A.1.5.1 Pearson Correlation
Correlation can be used for ratio/interval scaled features. We typically think of the Pearson correlation coefficient between features (columns).
cc <- iris |>
select(-Species) |>
cor()
cc
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
cor
calculates a correlation matrix with pairwise correlations between
features. Correlation matrices are symmetric, but different to
distances, the whole matrix is stored.
The correlation between Petal.Length and Petal.Width can be visualized using a scatter plot.
ggplot(iris, aes(Petal.Length, Petal.Width)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
data:image/s3,"s3://crabby-images/17cca/17ccadaa2fce40ccdc3b24dc1438473c985ca630" alt=""
geom_smooth
adds a regression line by fitting a linear model (lm
).
Most points are close to this line indicating strong linear dependence
(i.e., high correlation).
We can calculate individual correlations by specifying two vectors.
Note: with
lets you use columns using just their names and
with(iris, cor(Petal.Length, Petal.Width))
is the same as
cor(iris$Petal.Length, iris$Petal.Width)
.
Finally, we can test if a correlation is significantly different from zero.
with(iris, cor.test(Petal.Length, Petal.Width))
##
## Pearson's product-moment correlation
##
## data: Petal.Length and Petal.Width
## t = 43, df = 148, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9491 0.9730
## sample estimates:
## cor
## 0.9629
A small p-value (less than 0.05) indicates that the observed correlation is significantly different from zero. This can also be seen by the fact that the 95% confidence interval does not span zero.
Sepal.Length and Sepal.Width show little correlation:
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
data:image/s3,"s3://crabby-images/9bfbf/9bfbff8c68698246d085cdf8f84ff6823bdc2eb7" alt=""
with(iris, cor(Sepal.Length, Sepal.Width))
## [1] -0.1176
with(iris, cor.test(Sepal.Length, Sepal.Width))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Sepal.Width
## t = -1.4, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269 0.04351
## sample estimates:
## cor
## -0.1176
A.1.5.2 Rank Correlation
Rank correlation is used for ordinal features or if the correlation is
not linear. To show this, we first convert the continuous features in
the Iris dataset into ordered factors (ordinal) with three levels using
the function cut
.
iris_ord <- iris |>
mutate(across(where(is.numeric),
function(x) cut(x, 3,
labels = c("short", "medium", "long"),
ordered = TRUE)))
iris_ord
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <ord> <ord> <ord> <ord> <fct>
## 1 short medium short short setosa
## 2 short medium short short setosa
## 3 short medium short short setosa
## 4 short medium short short setosa
## 5 short medium short short setosa
## 6 short long short short setosa
## 7 short medium short short setosa
## 8 short medium short short setosa
## 9 short medium short short setosa
## 10 short medium short short setosa
## # ℹ 140 more rows
summary(iris_ord)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## short :59 short :47 short :50 short :50
## medium:71 medium:88 medium:54 medium:54
## long :20 long :15 long :46 long :46
## Species
## setosa :50
## versicolor:50
## virginica :50
iris_ord |>
pull(Sepal.Length)
## [1] short short short short short short short
## [8] short short short short short short short
## [15] medium medium short short medium short short
## [22] short short short short short short short
## [29] short short short short short short short
## [36] short short short short short short short
## [43] short short short short short short short
## [50] short long medium long short medium medium
## [57] medium short medium short short medium medium
## [64] medium medium medium medium medium medium medium
## [71] medium medium medium medium medium medium long
## [78] medium medium medium short short medium medium
## [85] short medium medium medium medium short short
## [92] medium medium short medium medium medium medium
## [99] short medium medium medium long medium medium
## [106] long short long medium long medium medium
## [113] long medium medium medium medium long long
## [120] medium long medium long medium medium long
## [127] medium medium medium long long long medium
## [134] medium medium long medium medium medium long
## [141] medium long medium long medium medium medium
## [148] medium medium medium
## Levels: short < medium < long
Two measures for rank correlation are Kendall’s Tau and Spearman’s Rho.
Kendall’s Tau Rank Correlation Coefficient measures the agreement between two rankings (i.e., ordinal features).
iris_ord |>
select(-Species) |>
sapply(xtfrm) |>
cor(method = "kendall")
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1438 0.7419
## Sepal.Width -0.1438 1.0000 -0.3299
## Petal.Length 0.7419 -0.3299 1.0000
## Petal.Width 0.7295 -0.3154 0.9198
## Petal.Width
## Sepal.Length 0.7295
## Sepal.Width -0.3154
## Petal.Length 0.9198
## Petal.Width 1.0000
Note: We have to use xtfrm
to transform the ordered factors into
ranks, i.e., numbers representing the order.
Spearman’s Rho is equal to the Pearson correlation between the rank values of those two features.
iris_ord |>
select(-Species) |>
sapply(xtfrm) |>
cor(method = "spearman")
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1570 0.7938
## Sepal.Width -0.1570 1.0000 -0.3663
## Petal.Length 0.7938 -0.3663 1.0000
## Petal.Width 0.7843 -0.3517 0.9399
## Petal.Width
## Sepal.Length 0.7843
## Sepal.Width -0.3517
## Petal.Length 0.9399
## Petal.Width 1.0000
Spearman’s Rho is much faster to compute on large datasets then Kendall’s Tau.
Comparing the rank correlation results with the Pearson correlation on the original data shows that they are very similar. This indicates that discretizing data does not result in the loss of too much information.
iris |>
select(-Species) |>
cor()
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
A.1.6 Density
Density estimation estimate the probability density function (distribution) of a continuous variable from observed data.
Just plotting the data using points is not very helpful for a single feature.
ggplot(iris, aes(x = Petal.Length, y = 0)) + geom_point()
data:image/s3,"s3://crabby-images/78d79/78d792f68b8ab4f636922b6cf5c40e638dcb3920" alt=""
A.1.6.1 Histograms
A histograms shows more about
the distribution by counting how many values fall within a bin and
visualizing the counts as a bar chart. We use geom_rug
to place marks
for the original data points at the bottom of the histogram.
ggplot(iris, aes(x = Petal.Length)) +
geom_histogram() +
geom_rug(alpha = 1/2)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
data:image/s3,"s3://crabby-images/9ad49/9ad49d1304bd60f59454df45c869b835c25c879e" alt=""
Two-dimensional distributions can be visualized using 2-d binning or hexagonal bins.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_bin2d(bins = 10) +
geom_jitter(color = "red")
data:image/s3,"s3://crabby-images/98362/98362e5915398ae0e8d30a922ab07ba5a1dd92a0" alt=""
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_hex(bins = 10) +
geom_jitter(color = "red")
data:image/s3,"s3://crabby-images/b9239/b92394f72c009980fe3b45b86b3f7b0a6ee7773b" alt=""
A.1.6.2 Kernel Density Estimate (KDE)
Kernel density
estimation is
used to estimate the probability density function (distribution) of a
feature. It works by replacing each value with a kernel function (often
a Gaussian) and then adding them up. The result is an estimated
probability density function that looks like a smoothed version of the
histogram. The bandwidth (bw
) of the kernel controls the amount of
smoothing.
ggplot(iris, aes(Petal.Length)) +
geom_density(bw = .2) +
geom_rug(alpha = 1/2)
data:image/s3,"s3://crabby-images/87df6/87df67d6b74bd833947c7776dd5f42370f4bacbe" alt=""
Kernel density estimates can also be done in two dimensions.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_density_2d_filled() +
geom_jitter()
data:image/s3,"s3://crabby-images/2e4c8/2e4c885273512629004cd7c58eac0d68baa8296a" alt=""
A.2 Visualization
Visualization uses several components to convey information:
- Symbols: Circles, dots, lines, bars, …
- Position: Axes (labels and units) and the origin (0/0) are important.
- Length, Size and Area: Should only be used if it faithfully represents information.
- Color: A color should not overpower another. Limit to 3 or 4 colors.
- Angle: Human eyes are not good at comparing angles!
All components of the plot need to convey information. E.g., do not use color just to make it more colorful. All good visualizations show the important patterns clearly.
For space reasons, the chapter on data exploration and visualization was moved from the printed textbook and can now be found in the Data Exploration Web Chapter.
For visualization in R, we will use mainly ggplot2
. The gg in ggplot2
stands for Grammar of Graphics introduced by Wilkinson (2005). The
main idea is that every graph is built from the same basic components:
- the data,
- a coordinate system, and
- visual marks representing the data (geoms).
In ggplot2
, the components are combined using the +
operator.
ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
geom_point()
Since we typically use a Cartesian coordinate system, ggplot
uses that
by default. Each geom_
function uses a stat_
function to calculate
what is visualizes. For example, geom_bar
uses stat_count
to create
a bar chart by counting how often each value appears in the data (see
? geom_bar
). geom_point
just uses the stat "identity"
to display
the points using the coordinates as they are.
Additional components like the main title, axis labels, and different scales can be also added. A great introduction can be found in the Section on Data Visualization (Wickham, Çetinkaya-Rundel, and Grolemund 2023), and very useful is RStudio’s Data Visualization Cheatsheet.
Next, we go through the basic visualizations used when working with data.
A.2.1 Histogram
Histograms show the distribution of a single continuous feature.
ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)
data:image/s3,"s3://crabby-images/ae62f/ae62fd2170121e10c86a58acfb7d1c4350eba7d3" alt=""
If the data contains groups, then the group information can be easily added as an aesthetic to the histogram.
ggplot(iris, aes(Petal.Width)) +
geom_histogram(bins = 20, aes(fill = Species))
data:image/s3,"s3://crabby-images/c0bc5/c0bc55d70cb15894131949af3de368b4e39ac853" alt=""
The bars appear stacked with a green blocks on top pf blue blocks.
To display the three distributions behind each other, we change position
for placement and make the bars slightly translucent using alpha
.
ggplot(iris, aes(Petal.Width)) +
geom_histogram(bins = 20, aes(fill = Species), alpha = .5, position = 'identity')
data:image/s3,"s3://crabby-images/4eb5f/4eb5f06eb0835a302d4a6f6d2fc095787d4f8729" alt=""
A.2.2 Boxplot
Boxplots are used to compare the distribution of a feature between different groups. The horizontal line in the middle of the boxes are the group-wise medians, the boxes span the interquartile range. The whiskers (vertical lines) span typically 1.4 times the interquartile range. Points that fall outside that range are typically outliers shown as dots.
ggplot(iris, aes(Species, Sepal.Length)) +
geom_boxplot()
data:image/s3,"s3://crabby-images/b853a/b853a126acb31a37a5cee5d1749080f9ad0d0f09" alt=""
The group-wise medians can also be calculated directly.
iris |> group_by(Species) |>
summarize(across(where(is.numeric), median))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5 3.4 1.5 0.2
## 2 versico… 5.9 2.8 4.35 1.3
## 3 virgini… 6.5 3 5.55 2
To compare the distribution of the four features using a ggplot boxplot, we first have to transform the data into long format (i.e., all feature values are combined into a single column).
library(tidyr)
iris_long <- iris |>
mutate(id = row_number()) |>
pivot_longer(1:4)
ggplot(iris_long, aes(name, value)) +
geom_boxplot() +
labs(y = "Original value")
data:image/s3,"s3://crabby-images/5e2b4/5e2b44badacafc0ebfb2af9a2f809b49a12e2b2a" alt=""
This visualization is only useful if all features have roughly the same range. The data can be scaled first to compare the distributions.
library(tidyr)
iris_long_scaled <- iris |>
scale_numeric() |>
mutate(id = row_number()) |> pivot_longer(1:4)
ggplot(iris_long_scaled, aes(name, value)) +
geom_boxplot() +
labs(y = "Scaled value")
data:image/s3,"s3://crabby-images/903de/903dead1fdef87df4ae0eff3f70db57e3688cd77" alt=""
A.2.3 Scatter plot
Scatter plots show the relationship between two continuous features.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species))
data:image/s3,"s3://crabby-images/042d1/042d17ec3dc15a8a6d04fed1cea4e300fd698d15" alt=""
We can add a regression using geom_smooth
with the
linear model method to show that there is a
linear relationship between the two variables. A confidence interval
for the regression is also shown.
This can be suppressed using se = FALSE
.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species)) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
data:image/s3,"s3://crabby-images/37959/37959512abc1667e776c6b0b6956038a1e411abd" alt=""
We can also perform group-wise linear regression by adding the color
aesthetic also to geom_smooth
.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species)) +
geom_smooth(method = "lm", aes(color = Species))
## `geom_smooth()` using formula = 'y ~ x'
data:image/s3,"s3://crabby-images/e4b18/e4b182af9692ca3735392e416dda6daa631d3ced" alt=""
The same can be achieved by using the color aesthetic in the qqplot
call,
then it applies to all geoms.
A.2.4 Scatter Plot Matrix
A scatter plot matrix show the relationship between all pairs of features by arranging panels in a matrix. First, lets look at a regular R-base plot.
pairs(iris, col = iris$Species)
data:image/s3,"s3://crabby-images/dcee8/dcee83fba53d85574b24ad2c055836ea7a79aec7" alt=""
The package GGally
provides a way more sophisticated visualization.
library("GGally")
ggpairs(iris, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
data:image/s3,"s3://crabby-images/c6ad0/c6ad0ed141c2ecec33246d29628e20fa999d57f6" alt=""
Additional plots (histograms, density estimates and box plots) and correlation coefficients are shown in different panels. See the Data Quality section for a description of how to interpret the different panels.
A.2.5 Matrix Visualization
Matrix visualization shows the values in the matrix using a color scale.
We need the long format for tidyverse.
iris_long <- as_tibble(iris_matrix) |>
mutate(id = row_number()) |>
pivot_longer(1:4)
head(iris_long)
## # A tibble: 6 × 3
## id name value
## <int> <chr> <dbl>
## 1 1 Sepal.Length 5.1
## 2 1 Sepal.Width 3.5
## 3 1 Petal.Length 1.4
## 4 1 Petal.Width 0.2
## 5 2 Sepal.Length 4.9
## 6 2 Sepal.Width 3
ggplot(iris_long, aes(x = name, y = id)) +
geom_tile(aes(fill = value))
data:image/s3,"s3://crabby-images/36b10/36b10de9fbced2268be79a9882d23b0ef86e55aa" alt=""
Smaller values are darker. Package seriation
provides a simpler
plotting function.
data:image/s3,"s3://crabby-images/a2c76/a2c76ed2a65a266db07c9ecbe59f5ed0a40e1c3b" alt=""
We can scale the features to z-scores to make them better comparable.
data:image/s3,"s3://crabby-images/43377/43377e51a3379baf3a7fbad26ab67bd590847adb" alt=""
This reveals red and blue blocks. Each row is a flower and the flowers in the Iris dataset are sorted by species. The blue blocks for the top 50 flowers show that these flowers are smaller than average for all but Sepal.Width and the red blocks show that the bottom 50 flowers are larger for most features.
Often, reordering data matrices help with visualization. A reordering technique is called seriation. Ir reorders rows and columns to place more similar points closer together.
data:image/s3,"s3://crabby-images/ab791/ab791e2fab21b3fa02d66b96d9b1fc1430cffc06" alt=""
We see that the rows (flowers) are organized from very blue to very red and the features are reordered to move Sepal.Width all the way to the right because it is very different from the other features.
A.2.6 Correlation Matrix
A correlation matrix contains the correlation between features.
cm1 <- iris |>
select(-Species) |>
as.matrix() |>
cor()
cm1
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
Package ggcorrplot
provides a visualization for correlation matrices.
library(ggcorrplot)
ggcorrplot(cm1)
data:image/s3,"s3://crabby-images/6808a/6808a563f1cdfd40bf19683d8b0e61e64bb976c5" alt=""
Package seriation
provides a reordered version for this plot using a
heatmap.
gghmap(cm1, prop = TRUE)
data:image/s3,"s3://crabby-images/56b9f/56b9fb331f30eb98e41948048a1ba60534871eec" alt=""
Correlations can also be calculates between objects by transposing the data matrix.
cm2 <- iris |>
select(-Species) |>
as.matrix() |>
t() |>
cor()
ggcorrplot(cm2)
data:image/s3,"s3://crabby-images/78622/78622a864d0d1c3202e769699df793faad7be36b" alt=""
Object-to-object correlations can be used as a measure of similarity. The dark red blocks indicate different species.
A.2.7 Parallel Coordinates Plot
Parallel coordinate plots can visualize several features in a single plot. Lines connect the values for each object (flower).
library(GGally)
ggparcoord(iris, columns = 1:4, groupColumn = 5)
data:image/s3,"s3://crabby-images/0ba0d/0ba0db559a91b97b73604a4a7661e74fcb470d76" alt=""
The plot can be improved by reordering the variables to place correlated features next to each other.
o <- seriate(as.dist(1-cor(iris[,1:4])), method = "BBURCG")
get_order(o)
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 3 4 1 2
ggparcoord(iris,
columns = as.integer(get_order(o)),
groupColumn = 5)
data:image/s3,"s3://crabby-images/ff495/ff49507ee8f172c771bf4014dc482f5da847f391" alt=""
A.2.8 Star Plot
Star plots are a type of radar chart to visualize three or more quantitative variables represented on axes starting from the plot’s origin. R-base offers a simple star plot. We plot the first 5 flowers of each species.
flowers_5 <- iris[c(1:5, 51:55, 101:105), ]
flowers_5
## # A tibble: 15 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 7 3.2 4.7 1.4 versic…
## 7 6.4 3.2 4.5 1.5 versic…
## 8 6.9 3.1 4.9 1.5 versic…
## 9 5.5 2.3 4 1.3 versic…
## 10 6.5 2.8 4.6 1.5 versic…
## 11 6.3 3.3 6 2.5 virgin…
## 12 5.8 2.7 5.1 1.9 virgin…
## 13 7.1 3 5.9 2.1 virgin…
## 14 6.3 2.9 5.6 1.8 virgin…
## 15 6.5 3 5.8 2.2 virgin…
stars(flowers_5[, 1:4], ncol = 5)
data:image/s3,"s3://crabby-images/68ff0/68ff04f39c501ea3bf6dd2aca64b105235dd741e" alt=""
A.2.9 More Visualizations
A well organized collection of visualizations with code can be found at The R Graph Gallery.
A.3 Exercises*
The R package palmerpenguins contains measurements for penguin of different species from the Palmer Archipelago, Antarctica. Install the package. It provides a CSV file which can be read in the following way:
library("palmerpenguins")
penguins <- read_csv(path_to_file("penguins.csv"))
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_m...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm
## <chr> <chr> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 19.3
## 6 Adelie Torgersen 39.3 20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, year <dbl>
Create in RStudio a new R Markdown document. Apply the code in the sections of this chapter to the data set to answer the following questions.
- Group the penguins by species, island or sex. What can you find out?
- Create histograms and boxplots for each continuous variable. Interpret the distributions.
- Create scatterplots and a scatterplot matrix. Can you identify correlations?
- Create a reordered correlation matrix visualization. What does the visualizations show?
Make sure your markdown document contains now a well formatted report.
Use the Knit
button to create a HTML document.