2 Data
Data for data mining is typically organized in tabular form, with rows containing the objects of interest and columns representing attributes describing the objects. We will discuss topics like data quality, sampling, feature selection, and how to measure similarities between objects and features. The second part of this chapter deals with data exploration and visualization.
Packages Used in this Chapter
pkgs <- c("arules", "caret", "factoextra", "GGally",
"ggcorrplot", "hexbin", "palmerpenguins", "plotly",
"proxy", "seriation", "tidyverse")
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)
The packages used for this chapter are:
- arules (Hahsler et al. 2024)
- caret (Kuhn 2023)
- factoextra (Kassambara and Mundt 2020)
- GGally (Schloerke et al. 2024)
- ggcorrplot (Kassambara 2023)
- hexbin (Carr, Lewin-Koh, and Maechler 2024)
- palmerpenguins (Horst, Hill, and Gorman 2022)
- plotly (Sievert et al. 2024)
- proxy (Meyer and Buchta 2022)
- seriation (Hahsler, Buchta, and Hornik 2024)
- tidyverse (Wickham 2023c)
2.1 Types of Data
2.1.1 Attributes and Measurement
The values of features can be measured on several scales ranging from simple labels all the way to numbers. The scales come in four levels.
Scale Name | Description | Operations | Statistics | R |
---|---|---|---|---|
Nominal | just a label (e.g., red, green) | \(==, !=\) | counts | factor |
Ordinal | label with order (e.g., small, med., large) | \(<, >\) | median | ordered factor |
Interval | difference between two values is meaningful (regular number) | \(+, -\) | mean, sd | numeric |
Ratio | has a natural zero (e.g., count, distance) | \(/, *\) | percent | numeric |
The scales build on each other meaning that an ordinal variable also has the characteristics of a nominal variable with the added order information. We often do not differentiate between interval and ratio scale because we rarely not need to calculate percentages or other statistics that require a meaningful zero value.
Nominal data is created using factor()
. If the factor levels
are not specified, then they are created in alphabetical order.
Ordinal data is created using ordered()
. The levels specify the order.
Ratio/interval data is created as a simple vector.
c(1, 2, 3, 4, 3, 3)
## [1] 1 2 3 4 3 3
2.1.2 The Iris Dataset
We will use a toy dataset that comes with R. Fisher’s iris
dataset gives the
measurements in centimeters of the variables sepal length, sepal width
petal length, and petal width representing the features for 150 flowers (the objects).
The dataset contains 50
flowers from each of 3 species of iris. The species are Iris Setosa,
Iris Versicolor, and Iris Virginica. For more details see ? iris
.
We load the iris data set. Datasets that come with R or R packages can
be loaded with data()
. The standard format for data in R is a
data.frame. We convert the data.frame into a tidyverse tibble.
library(tidyverse)
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
We see that the data contains 150 rows (flowers) and 5 features. tibbles
only show the first few rows and do not show all features, if they do
not fit the screen width. We can call print
and define how many rows
to show using parameter n
and force print to show all features by
changing the width
to infinity.
print(iris, n = 3, width = Inf)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## # ℹ 147 more rows
2.2 Data Quality
Assessing the quality of the available data is crucial before we start
using the data. Start with summary statistics for each column to
identify outliers and missing values. The easiest way is to use the base R
function summary()
.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
You can also summarize individual columns using tidyverse’s dplyr functions.
Using across()
, multiple columns can be summarized. Un the following,
we calculate all numeric columns using the mean
function.
iris |>
summarize(across(where(is.numeric), mean))
## # A tibble: 1 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.84 3.06 3.76 1.20
To find outliers or data problems, you need to look for very small values (often a suspicious large number of zeros) using min and for extremely large values using max. Comparing median and mean tells us if the distribution is symmetric.
A visual method to inspect the data is to use a scatterplot matrix (we
use here ggpairs()
from package GGally
). In this plot, we can
visually identify noise data points and outliers (points that are far
from the majority of other points).
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(iris, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
This useful visualization combines many visualizations used to understand the data and check for quality issues. Rows and columns are the features in the data. We have also specified the aesthetic that we want to group each species using a different color.
The visualizations in the diagonal panels show the smoothed histograms with the distribution for each feature. The plot tries to pick a good number of bins for the histogram (see messages above). The distribution can be checked if it is close to normal, unimodal or highly skewed. Also, we can see if the different groups are overlapping or separable for each feature. For example, the three distributions for
Sepal.Width
are almost identical meaning that it is hard to distinguish between the different species using this feature alone.Petal.Lenght
andPetal.Width
are much better.The lower-left triangle panels contain scatterplots for all pairs features. These are useful to see if there if features are correlated (the pearson correlation coefficient if printed in the upper-right triangle). For example,
Petal.Length
andPetal.Width
are highly correlated overall This makes sense since larger plants will have both longer and wider petals. Inside the Setosa group this correlation it is a lot weaker. We can also see if groups are well separated using projections on two variables. Almost all panels show that Setosa forms a point cloud well separated from the other two classes while Versicolor and Virginica overlap. We can also see outliers that are far from the other data points in its group. See if you can spot the one red dot that is far away from all others.The last row/column represents in this data set the class label Species. It is a nominal variable so the plots are different. The bottom row panels show (regular) histograms. The last column shows boxplots to represent the distribution of the different features by group. Dots represent outliers. Finally, the bottom-right panel contains the counts for the different groups as a barplot. In this data set, each group has the same number of observations.
Many data mining methods require complete data, that is the data cannot
contain missing values (NA
). To remove missing values and duplicates
(identical data points which might be a mistake in the data), we often
do this:
clean.data <- iris |>
drop_na() |>
unique()
summary(clean.data)
## Sepal.Length Sepal.Width Petal.Length
## Min. :4.30 Min. :2.00 Min. :1.00
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60
## Median :5.80 Median :3.00 Median :4.30
## Mean :5.84 Mean :3.06 Mean :3.75
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10
## Max. :7.90 Max. :4.40 Max. :6.90
## Petal.Width Species
## Min. :0.10 setosa :50
## 1st Qu.:0.30 versicolor:50
## Median :1.30 virginica :49
## Mean :1.19
## 3rd Qu.:1.80
## Max. :2.50
Note that one non-unique case is gone leaving only 149 flowers. The data did not contain missing values, but if it did, they would also have been dropped. Typically, you should spend a lot more time on data cleaning.
2.3 Data Preprocessing
2.3.1 Aggregation
Data often contains groups and we want to compare these groups. We group the iris dataset by species and then calculate a summary statistic for each group.
iris |>
group_by(Species) |>
summarize(across(everything(), mean))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versico… 5.94 2.77 4.26 1.33
## 3 virgini… 6.59 2.97 5.55 2.03
iris |>
group_by(Species) |>
summarize(across(everything(), median))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5 3.4 1.5 0.2
## 2 versico… 5.9 2.8 4.35 1.3
## 3 virgini… 6.5 3 5.55 2
Using this information, we can compare how features differ between groups.
2.3.2 Sampling
Sampling is often used in data mining to reduce the dataset size before modeling or visualization.
2.3.2.1 Random Sampling
The built-in sample function can sample from a vector. Here we sample with replacement.
We often want to sample rows from a dataset. This can be done by
sampling without replacement from a vector with row indices (using the
functions seq()
and nrow()
). The sample vector is then used to
subset the rows of the dataset.
take <- sample(seq(nrow(iris)), size = 15)
take
## [1] 43 79 26 52 18 20 7 73 149 88 140 139 67 27
## [15] 101
iris[take, ]
## # A tibble: 15 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.4 3.2 1.3 0.2 setosa
## 2 6 2.9 4.5 1.5 versic…
## 3 5 3 1.6 0.2 setosa
## 4 6.4 3.2 4.5 1.5 versic…
## 5 5.1 3.5 1.4 0.3 setosa
## 6 5.1 3.8 1.5 0.3 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 6.3 2.5 4.9 1.5 versic…
## 9 6.2 3.4 5.4 2.3 virgin…
## 10 6.3 2.3 4.4 1.3 versic…
## 11 6.9 3.1 5.4 2.1 virgin…
## 12 6 3 4.8 1.8 virgin…
## 13 5.6 3 4.5 1.5 versic…
## 14 5 3.4 1.6 0.4 setosa
## 15 6.3 3.3 6 2.5 virgin…
dplyr
from tidyverse lets us sample rows from tibbles directly using
slice_sample()
. I set the random number generator seed to make the
results reproducible.
set.seed(1000)
s <- iris |>
slice_sample(n = 15)
library(GGally)
ggpairs(s, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Instead of n
you can also specify the proportion of rows to select using
prob
.
2.3.2.2 Stratified Sampling
Stratified sampling is a method of sampling from a population which can be partitioned into subpopulations, while controlling the proportions of the subpopulation in the resulting sample.
In the following, the subpopulations are the different types of species and we want to make sure to sample the same number (5) flowers from each. This can be achieved by first grouping the data by species and then sampling a number of flowers from each group.
set.seed(1000)
s2 <- iris |>
group_by(Species) |>
slice_sample(n = 5) |>
ungroup()
library(GGally)
ggpairs(s2, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
More sophisticated sampling procedures are implemented in the package
sampling
.
2.3.3 Dimensionality Reduction
The number of features is often called the dimensional of the data following the idea that each feature (at least the numeric features) can be seen as an axis of the data. High-dimensional data is harder to analyze by the user (e.g., visualize). It also is problematic for many data mining algorithms since it requires more memory and computational resources.
Dimensionality reduction tries to represent high-dimensional data in a low-dimensional space so that the low-dimensional representation retains some meaningful properties (e.g., information about similarity or distances) of the original data. Dimensionality reduction is used for visualization and as a prepossessing technique before using other data mining methods like clustering and classification.
2.3.3.1 Principal Components Analysis (PCA)
PCA calculates principal components (a set of new orthonormal basis vectors in the data space) from data points such that the first principal component explains the most variability in the data, the second the next most and so on. In data analysis, PCA is used to project high-dimensional data points onto the first few (typically two) principal components for visualization as a scatter plot and as preprocessing for modeling (e.g., before k-means clustering). Points that are closer together in the high-dimensional original space, tend also be closer together when projected into the lower-dimensional space,
We can use an interactive 3-d plot (from package plotly
) to look at
three of the four dimensions of the iris dataset. Note that it is hard
to visualize more than 3 dimensions.
plotly::plot_ly(iris,
x = ~Sepal.Length,
y = ~Petal.Length,
z = ~Sepal.Width,
color = ~Species, size = 1) |>
plotly::add_markers()
The principal components can be calculated from a matrix using the
function prcomp()
. We select all numeric columns (by unselecting the
species column) and convert the tibble into a matrix before the
calculation.
pc <- iris |>
select(-Species) |>
as.matrix() |>
prcomp()
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.056 0.4926 0.2797 0.15439
## Proportion of Variance 0.925 0.0531 0.0171 0.00521
## Cumulative Proportion 0.925 0.9777 0.9948 1.00000
How important is each principal component can also be seen using a
scree plot. The plot
function for the result of the prcomp
function visualizes how much
variability in the data is explained by each additional principal
component.
plot(pc, type = "line")
Note that the first principal component (PC1) explains most of the variability in the iris dataset.
To find out what information is stored in the object pc
, we can
inspect the raw object (display structure).
str(pc)
## List of 5
## $ sdev : num [1:4] 2.056 0.493 0.28 0.154
## $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ scale : logi FALSE
## $ x : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
The object pc
(like most objects in R) is a list with a class
attribute. The list element x
contains the data points projected on
the principal components. We can convert the matrix into a tibble and
add the species column from the original dataset back (since the rows
are in the same order), and then display the data projected on the first
two principal components.
iris_projected <- as_tibble(pc$x) |>
add_column(Species = iris$Species)
ggplot(iris_projected, aes(x = PC1, y = PC2, color = Species)) +
geom_point()
Flowers that are displayed close together in this projection are also close together in the original 4-dimensional space. Since the first principal component represents most of the variability, we can also show the data projected only on PC1.
ggplot(iris_projected,
aes(x = PC1, y = 0, color = Species)) +
geom_point() +
scale_y_continuous(expand=c(0,0)) +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank()
)
We see that we can perfectly separate the species Setosa using just the first principal component. The other two species are harder to separate.
A plot of the projected data with the original axes added as arrows is called a biplot. If the arrows (original axes) align roughly with the axes of the projection, then they are correlated (linearly dependent).
library(factoextra)
fviz_pca(pc)
We can also display only the old and new axes.
fviz_pca_var(pc)
We see Petal.Width and Petal.Length point in the same direction which indicates that they are highly correlated. They are also roughly aligned with PC1 (called Dim1 in the plot) which means that PC1 represents most of the variability of these two variables. Sepal.Width is almost aligned with the y-axis and therefore it is represented by PC2 (Dim2). Petal.Width/Petal.Length and Sepal.Width are almost at 90 degrees, indicating that they are close to uncorrelated. Sepal.Length is correlated to all other variables and represented by both, PC1 and PC2 in the projection.
2.3.3.2 Multi-Dimensional Scaling (MDS)
MDS is similar to PCA. Instead of data points, it starts with pairwise distances (i.e., a distance matrix) and produces a space where points are placed to represent these distances as well as possible. The axes in this space are called components and are similar to the principal components in PCA.
First, we calculate a distance matrix (Euclidean distances) from the 4-d space of the iris dataset.
Metric (classic) MDS tries to construct a space where points with lower
distances are placed closer together. We project the data represented by
a distance matrix on k = 2
dimensions.
fit <- cmdscale(d, k = 2)
colnames(fit) <- c("comp1", "comp2")
fit <- as_tibble(fit) |>
add_column(Species = iris$Species)
ggplot(fit, aes(x = comp1, y = comp2, color = Species)) +
geom_point()
The resulting projection is similar (except for rotation and reflection) to the result of the projection using PCA.
2.3.3.3 Non-Parametric Multidimensional Scaling
Non-parametric multidimensional scaling performs MDS while relaxing the
need of linear relationships. Methods are available in package MASS
as
functions isoMDS()
(implements isoMAP) and sammon()
.
2.3.3.4 Embeddings: Nonlinear Dimensionality Reduction Methods
Nonlinear dimensionality reduction is also called manifold learning or creating a low-dimensional embedding. These methods have become very popular and many methods with varying properties exist. Some popular methods are:
-
t-distributed stochastic neighbor embedding
(
Rtsne()
in packageRtsne
) and uniform manifold approximation and projection (umap()
in packageumap
) are used for projecting data to 2 dimensions for visualization. -
Autoencoders is a type of
artificial neural network used to learn an efficient representation (encoding)
for a set of data by minimizing the reconstruction error.
In R, autoencoders are typically created using the
keras
package. -
Word2vec (
word2vec()
inword2vec
) is used in natural language processing to convert words to numeric vectors for similarity calculation.
2.3.4 Feature Subset Selection
Feature selection is the process of identifying the features that are used to create a model. We will talk about feature selection when we discuss classification models in Chapter 3 in Feature Selection*.
2.3.5 Discretization
Some data mining methods require discrete data. Discretization converts continuous features into discrete features. As an example, we will discretize the continuous feature Petal.Width. Before we perform discretization, we should look at the distribution and see if it gives us an idea how we should group the continuous values into a set of discrete values. A histogram visualizes the distribution of a single continuous feature.
ggplot(iris, aes(x = Petal.Width)) +
geom_histogram(binwidth = .2)
The bins in the histogram represent a discretization using a fixed bin
width. The R function cut()
performs equal interval width
discretization creating a vector of type factor
where each level
represents an interval.
iris |>
pull(Sepal.Width) |>
cut(breaks = 3)
## [1] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [11] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4]
## [16] (3.6,4.4] (3.6,4.4] (2.8,3.6] (3.6,4.4] (3.6,4.4]
## [21] (2.8,3.6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [26] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [31] (2.8,3.6] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
## [36] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [41] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6] (3.6,4.4]
## [46] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4] (2.8,3.6]
## [51] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8] (2,2.8]
## [56] (2,2.8] (2.8,3.6] (2,2.8] (2.8,3.6] (2,2.8]
## [61] (2,2.8] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6]
## [66] (2.8,3.6] (2.8,3.6] (2,2.8] (2,2.8] (2,2.8]
## [71] (2.8,3.6] (2,2.8] (2,2.8] (2,2.8] (2.8,3.6]
## [76] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6] (2,2.8]
## [81] (2,2.8] (2,2.8] (2,2.8] (2,2.8] (2.8,3.6]
## [86] (2.8,3.6] (2.8,3.6] (2,2.8] (2.8,3.6] (2,2.8]
## [91] (2,2.8] (2.8,3.6] (2,2.8] (2,2.8] (2,2.8]
## [96] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8] (2,2.8]
## [101] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [106] (2.8,3.6] (2,2.8] (2.8,3.6] (2,2.8] (2.8,3.6]
## [111] (2.8,3.6] (2,2.8] (2.8,3.6] (2,2.8] (2,2.8]
## [116] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2,2.8] (2,2.8]
## [121] (2.8,3.6] (2,2.8] (2,2.8] (2,2.8] (2.8,3.6]
## [126] (2.8,3.6] (2,2.8] (2.8,3.6] (2,2.8] (2.8,3.6]
## [131] (2,2.8] (3.6,4.4] (2,2.8] (2,2.8] (2,2.8]
## [136] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [141] (2.8,3.6] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6]
## [146] (2.8,3.6] (2,2.8] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## Levels: (2,2.8] (2.8,3.6] (3.6,4.4]
Other discretization methods include equal frequency discretization or
using k-means clustering. These methods are implemented by several R
packages. We use here the implementation in package arules
and
visualize the results as histograms with blue lines to separate
intervals assigned to each discrete value.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
iris |> pull(Petal.Width) |>
discretize(method = "interval", breaks = 3)
## [1] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [6] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [11] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [16] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [21] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [26] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [31] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [36] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [41] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [46] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
## [51] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [56] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [61] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [66] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [71] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [76] [0.9,1.7) [0.9,1.7) [1.7,2.5] [0.9,1.7) [0.9,1.7)
## [81] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [86] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [91] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [96] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [101] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [106] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [111] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [116] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7)
## [121] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [126] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7)
## [131] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7) [0.9,1.7)
## [136] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [141] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [146] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## attr(,"discretized:breaks")
## [1] 0.1 0.9 1.7 2.5
## attr(,"discretized:method")
## [1] interval
## Levels: [0.1,0.9) [0.9,1.7) [1.7,2.5]
To show the differences between the methods, we use the three discretization methods and draw blue lines in the histogram to show how they cut the data.
ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
geom_vline(
xintercept = iris |>
pull(Petal.Width) |>
discretize(method = "interval", breaks = 3, onlycuts = TRUE),
color = "blue"
) +
labs(title = "Discretization: interval",
subtitle = "Blue lines are discretization boundaries")
ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
geom_vline(
xintercept = iris |>
pull(Petal.Width) |>
discretize(method = "frequency", breaks = 3, onlycuts = TRUE),
color = "blue"
) +
labs(title = "Discretization: frequency",
subtitle = "Blue lines are discretization boundaries")
ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
geom_vline(
xintercept = iris |>
pull(Petal.Width) |>
discretize(method = "cluster", breaks = 3, onlycuts = TRUE),
color = "blue"
) +
labs(title = "Discretization: cluster",
subtitle = "Blue lines are discretization boundaries")
The user needs to decide on the number of intervals and the used method.
2.3.6 Variable Transformation: Standardization
Standardizing (scaling, normalizing) the range of features values is important to make them comparable. The most popular method is to convert the values of each feature to z-scores. by subtracting the mean (centering) and dividing by the standard deviation (scaling). The standardized feature will have a mean of zero and are measured in standard deviations from the mean. Positive values indicate how many standard deviation the original feature value was above the average. Negative standardized values indicate below-average values.
R-base provides the function scale()
to standardize the columns in a data.frame. Tidyverse currently does not have a simple scale function, so we
make one It mutates all numeric columns using an anonymous function
that calculates the z-score.
scale_numeric <- function(x)
x |>
mutate(across(where(is.numeric),
function(y) (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)))
iris.scaled <- iris |>
scale_numeric()
iris.scaled
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 -0.898 1.02 -1.34 -1.31 setosa
## 2 -1.14 -0.132 -1.34 -1.31 setosa
## 3 -1.38 0.327 -1.39 -1.31 setosa
## 4 -1.50 0.0979 -1.28 -1.31 setosa
## 5 -1.02 1.25 -1.34 -1.31 setosa
## 6 -0.535 1.93 -1.17 -1.05 setosa
## 7 -1.50 0.786 -1.34 -1.18 setosa
## 8 -1.02 0.786 -1.28 -1.31 setosa
## 9 -1.74 -0.361 -1.34 -1.31 setosa
## 10 -1.14 0.0979 -1.28 -1.44 setosa
## # ℹ 140 more rows
summary(iris.scaled)
## Sepal.Length Sepal.Width Petal.Length
## Min. :-1.8638 Min. :-2.426 Min. :-1.562
## 1st Qu.:-0.8977 1st Qu.:-0.590 1st Qu.:-1.222
## Median :-0.0523 Median :-0.132 Median : 0.335
## Mean : 0.0000 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 0.6722 3rd Qu.: 0.557 3rd Qu.: 0.760
## Max. : 2.4837 Max. : 3.080 Max. : 1.780
## Petal.Width Species
## Min. :-1.442 setosa :50
## 1st Qu.:-1.180 versicolor:50
## Median : 0.132 virginica :50
## Mean : 0.000
## 3rd Qu.: 0.788
## Max. : 1.706
The standardized feature has a mean of zero and most “normal” values will fall in the range \([-3,3]\) and is measured in standard deviations from the average. Negative values mean smaller than the average and positive values mean larger than the average.
2.4 Measures of Similarity and Dissimilarity
Proximities help with quantifying how similar two objects are. Similariy is a concept from geometry. The best-known way to define similarity is Euclidean distance, but proximities can be measured in different ways depending on the information we have about the objects.
R stores proximity as dissimilarities/distances matrices. Similarities are first converted to dissimilarities. Distances are symmetric, i.e., the distance from A to B is the same as the distance from B to A. R therefore stores only a triangle (typically the lower triangle) of the distance matrix.
2.4.1 Minkowsky Distances
The Minkowsky distance is a family of metric distances including Euclidean and Manhattan distance. To avoid one feature to dominate the distance calculation, scaled data is typically used. We select the first 5 flowers for this example.
iris_sample <- iris.scaled |>
select(-Species) |>
slice(1:5)
iris_sample
## # A tibble: 5 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.898 1.02 -1.34 -1.31
## 2 -1.14 -0.132 -1.34 -1.31
## 3 -1.38 0.327 -1.39 -1.31
## 4 -1.50 0.0979 -1.28 -1.31
## 5 -1.02 1.25 -1.34 -1.31
Different types of Minkowsky distance matrices between the first 5
flowers can be calculated using dist()
.
dist(iris_sample, method = "euclidean")
## 1 2 3 4
## 2 1.1723
## 3 0.8428 0.5216
## 4 1.1000 0.4326 0.2829
## 5 0.2593 1.3819 0.9883 1.2460
dist(iris_sample, method = "manhattan")
## 1 2 3 4
## 2 1.3887
## 3 1.2280 0.7570
## 4 1.5782 0.6484 0.4635
## 5 0.3502 1.4973 1.3367 1.6868
dist(iris_sample, method = "maximum")
## 1 2 3 4
## 2 1.1471
## 3 0.6883 0.4589
## 4 0.9177 0.3623 0.2294
## 5 0.2294 1.3766 0.9177 1.1471
We see that only the lower triangle of the distance matrices are stored (note that rows start with row 2).
2.4.2 Distances for Binary Data
Binary data can be encodes as 0
and 1
(numeric) or TRUE
and
FALSE
(logical).
b <- rbind(
c(0,0,0,1,1,1,1,0,0,1),
c(0,0,1,1,1,0,0,1,0,0)
)
b
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 1 1 1 1 0 0 1
## [2,] 0 0 1 1 1 0 0 1 0 0
b_logical <- apply(b, MARGIN = 2, as.logical)
b_logical
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
## [2,] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [,10]
## [1,] TRUE
## [2,] FALSE
2.4.2.1 Hamming Distance
The Hamming distance is the number of mismatches between two binary vectors. For 0-1 data this is equivalent to the Manhattan distance and also the squared Euclidean distance.
2.4.2.2 Jaccard Index
The Jaccard index is a similarity measure that focuses on matching 1s. R converts the similarity into a dissimilarity using \(d_{J} = 1 - s_{J}\).
dist(b, method = "binary")
## 1
## 2 0.7143
2.4.3 Distances for Mixed Data
Most distance measures work only on numeric data. Often, we have a mixture of numbers and nominal or ordinal features like this data:
people <- tibble(
height = c( 160, 185, 170),
weight = c( 52, 90, 75),
sex = c( "female", "male", "male")
)
people
## # A tibble: 3 × 3
## height weight sex
## <dbl> <dbl> <chr>
## 1 160 52 female
## 2 185 90 male
## 3 170 75 male
It is important that nominal features are stored as factors and not
character (<chr>
).
people <- people |>
mutate(across(where(is.character), factor))
people
## # A tibble: 3 × 3
## height weight sex
## <dbl> <dbl> <fct>
## 1 160 52 female
## 2 185 90 male
## 3 170 75 male
2.4.3.1 Gower’s Coefficient
The Gower’s coefficient of similarity works with mixed data by
calculating the appropriate similarity for each feature and then
aggregating them into a single measure. The package proxy
implements
Gower’s coefficient converted into a distance.
library(proxy)
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
d_Gower <- dist(people, method = "Gower")
d_Gower
## 1 2
## 2 1.0000
## 3 0.6684 0.3316
Gower’s coefficient calculation implicitly scales the data because it calculates distances on each feature individually, so there is no need to scale the data first.
2.4.3.2 Using Euclidean Distance with Mixed Data
Sometimes methods (e.g., k-means) can only use Euclidean distance. In this case, nominal features can be converted into 0-1 dummy variables. After scaling, Euclidean distance will result in a usable distance measure.
We use package caret
to create dummy variables.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
data_dummy <- dummyVars(~., people) |>
predict(people)
data_dummy
## height weight sex.female sex.male
## 1 160 52 1 0
## 2 185 90 0 1
## 3 170 75 0 1
Note that feature sex has now two columns. If we want that height, weight and sex have the same influence on the distance measure, then we need to weight the sex columns by 1/2 after scaling.
weight_matrix <- matrix(c(1, 1, 1/2, 1/2),
ncol = 4,
nrow = nrow(data_dummy),
byrow = TRUE)
data_dummy_scaled <- scale(data_dummy) * weight_matrix
d_dummy <- dist(data_dummy_scaled)
d_dummy
## 1 2
## 2 3.064
## 3 1.891 1.427
The distance using dummy variables is consistent with Gower’s distance. However, note that Gower’s distance is scaled between 0 and 1 while the Euclidean distance is not.
ggplot(tibble(d_dummy, d_Gower), aes(x = d_dummy, y = d_Gower)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## Don't know how to automatically pick scale for object of
## type <dist>. Defaulting to continuous.
## Don't know how to automatically pick scale for object of
## type <dist>. Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'
2.4.4 More Proximity Measures
The package proxy
implements a wide array of proximity measures
(similarity measures are converted into distances).
library(proxy)
pr_DB$get_entry_names()
## [1] "Jaccard" "Kulczynski1" "Kulczynski2"
## [4] "Mountford" "Fager" "Russel"
## [7] "simple matching" "Hamman" "Faith"
## [10] "Tanimoto" "Dice" "Phi"
## [13] "Stiles" "Michael" "Mozley"
## [16] "Yule" "Yule2" "Ochiai"
## [19] "Simpson" "Braun-Blanquet" "cosine"
## [22] "angular" "eJaccard" "eDice"
## [25] "correlation" "Chi-squared" "Phi-squared"
## [28] "Tschuprow" "Cramer" "Pearson"
## [31] "Gower" "Euclidean" "Mahalanobis"
## [34] "Bhjattacharyya" "Manhattan" "supremum"
## [37] "Minkowski" "Canberra" "Wave"
## [40] "divergence" "Kullback" "Bray"
## [43] "Soergel" "Levenshtein" "Podani"
## [46] "Chord" "Geodesic" "Whittaker"
## [49] "Hellinger" "fJaccard"
Note that loading the package proxy
overwrites the default dist()
function
in R.
You can specify which dist function to use by specifying the package in
the call. For example stats::dist()
calls the default function in R
(the package stats
is part of R) while proxy::dist()
calls the
version in the package proxy
.
2.5 Data Exploration*
The following code covers the important part of data exploration. For space reasons, this chapter was moved from the printed textbook to this Data Exploration Web Chapter.
We will use again the iris dataset.
library(tidyverse)
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
2.5.1 Basic statistics
Get summary statistics (using base R)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Get mean and standard deviation for sepal length.
iris |>
summarize(avg_Sepal.Length = mean(Sepal.Length),
sd_Sepal.Length = sd(Sepal.Length))
## # A tibble: 1 × 2
## avg_Sepal.Length sd_Sepal.Length
## <dbl> <dbl>
## 1 5.84 0.828
Data with missing values will result in statistics of NA
. Adding the
parameter na.rm = TRUE
can be used in most statistics functions to
ignore missing values.
Outliers are typically the smallest or the largest values of a feature. To make the mean more robust against outliers, we can trim 10% of observations from each end of the distribution.
iris |>
summarize(
avg_Sepal.Length = mean(Sepal.Length),
trimmed_avg_Sepal.Length = mean(Sepal.Length, trim = .1)
)
## # A tibble: 1 × 2
## avg_Sepal.Length trimmed_avg_Sepal.Length
## <dbl> <dbl>
## 1 5.84 5.81
Sepal length does not have outliers, so the trimmed mean is almost identical.
To calculate a summary for a set of features (e.g., all numeric
features), tidyverse provides across(where(is.numeric), fun)
.
iris |> summarize(across(where(is.numeric), mean))
## # A tibble: 1 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.84 3.06 3.76 1.20
iris |> summarize(across(where(is.numeric), sd))
## # A tibble: 1 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 0.828 0.436 1.77 0.762
iris |> summarize(across(where(is.numeric),
list(min = min,
median = median,
max = max)))
## # A tibble: 1 × 12
## Sepal.Length_min Sepal.Length_median Sepal.Length_max
## <dbl> <dbl> <dbl>
## 1 4.3 5.8 7.9
## # ℹ 9 more variables: Sepal.Width_min <dbl>,
## # Sepal.Width_median <dbl>, Sepal.Width_max <dbl>,
## # Petal.Length_min <dbl>, Petal.Length_median <dbl>,
## # Petal.Length_max <dbl>, Petal.Width_min <dbl>,
## # Petal.Width_median <dbl>, Petal.Width_max <dbl>
The median absolute deviation (MAD) is another measure of dispersion.
2.5.2 Grouped Operations and Calculations
We can use the nominal feature to form groups and then calculate group-wise statistics for the continuous features. We often use group-wise averages to see if they differ between groups.
iris |>
group_by(Species) |>
summarize(across(Sepal.Length, mean))
## # A tibble: 3 × 2
## Species Sepal.Length
## <fct> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
iris |>
group_by(Species) |>
summarize(across(where(is.numeric), mean))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versico… 5.94 2.77 4.26 1.33
## 3 virgini… 6.59 2.97 5.55 2.03
We see that the species Virginica has the highest average for all, but Sepal.Width.
The statistical difference between the groups can be tested using ANOVA (analysis of variance).
res.aov <- aov(Sepal.Length ~ Species, data = iris)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.2 31.61 119 <2e-16 ***
## Residuals 147 39.0 0.27
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862 1.1738 0
## virginica-setosa 1.582 1.3382 1.8258 0
## virginica-versicolor 0.652 0.4082 0.8958 0
The summary shows that there is a significant difference for
Sepal.Length between the groups. TukeyHDS
evaluates differences
between pairs of groups. In this case, all are significantly different.
If the data only contains two groups, the t.test
can be used.
2.5.3 Tabulate data
We can count the number of flowers for each species.
iris |>
group_by(Species) |>
summarize(n())
## # A tibble: 3 × 2
## Species `n()`
## <fct> <int>
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
In base R, this can be also done using count(iris$Species)
.
For the following examples, we discretize the data using cut.
iris_ord <- iris |>
mutate(across(where(is.numeric),
function(x) cut(x, 3, labels = c("short", "medium", "long"),
ordered = TRUE)))
iris_ord
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <ord> <ord> <ord> <ord> <fct>
## 1 short medium short short setosa
## 2 short medium short short setosa
## 3 short medium short short setosa
## 4 short medium short short setosa
## 5 short medium short short setosa
## 6 short long short short setosa
## 7 short medium short short setosa
## 8 short medium short short setosa
## 9 short medium short short setosa
## 10 short medium short short setosa
## # ℹ 140 more rows
summary(iris_ord)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## short :59 short :47 short :50 short :50
## medium:71 medium:88 medium:54 medium:54
## long :20 long :15 long :46 long :46
## Species
## setosa :50
## versicolor:50
## virginica :50
Cross tabulation is used to find out if two discrete features are related.
tbl <- iris_ord |>
select(Sepal.Length, Species) |>
table()
tbl
## Species
## Sepal.Length setosa versicolor virginica
## short 47 11 1
## medium 3 36 32
## long 0 3 17
The table contains the number of rows that contain the combination of values (e.g., the number of flowers with a short Sepal.Length and species Setosa is 47). If a few cells have very large counts and most others have very low counts, then there might be a relationship. For the iris data, we see that species Setosa has mostly a short Sepal.Length, while Versicolor and Virginica have longer sepals.
Creating a cross table with tidyverse is a little more involved and uses pivot operations and grouping.
iris_ord |>
select(Species, Sepal.Length) |>
### Relationship Between Nominal and Ordinal Features
pivot_longer(cols = Sepal.Length) |>
group_by(Species, value) |>
count() |>
ungroup() |>
pivot_wider(names_from = Species, values_from = n)
## # A tibble: 3 × 4
## value setosa versicolor virginica
## <ord> <int> <int> <int>
## 1 short 47 11 1
## 2 medium 3 36 32
## 3 long NA 3 17
We can use a statistical test to determine if there is a significant relationship between the two features. Pearson’s chi-squared test for independence is performed with the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals. The null hypothesis h0 is independence between rows and columns.
tbl |>
chisq.test()
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 112, df = 4, p-value <2e-16
The small p-value indicates that the null hypothesis of independence needs to be rejected. For small counts (cells with counts <5), Fisher’s exact test is better.
fisher.test(tbl)
##
## Fisher's Exact Test for Count Data
##
## data: tbl
## p-value <2e-16
## alternative hypothesis: two.sided
2.5.4 Percentiles (Quantiles)
Quantiles are cutting points dividing the range of a probability distribution into continuous intervals with equal probability. For example, the median is the empirical 50% quantile dividing the observations into 50% of the observations being smaller than the median and the other 50% being larger than the median.
By default quartiles are calculated. 25% is typically called Q1, 50% is called Q2 or the median and 75% is called Q3.
The interquartile range is a measure for variability that is robust against outliers. It is defined the length Q3 - Q2 which covers the 50% of the data in the middle.
2.5.5 Correlation
2.5.5.1 Pearson Correlation
Correlation can be used for ratio/interval scaled features. We typically think of the Pearson correlation coefficient between features (columns).
cc <- iris |>
select(-Species) |>
cor()
cc
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
cor
calculates a correlation matrix with pairwise correlations between
features. Correlation matrices are symmetric, but different to
distances, the whole matrix is stored.
The correlation between Petal.Length and Petal.Width can be visualized using a scatter plot.
ggplot(iris, aes(Petal.Length, Petal.Width)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
geom_smooth
adds a regression line by fitting a linear model (lm
).
Most points are close to this line indicating strong linear dependence
(i.e., high correlation).
We can calculate individual correlations by specifying two vectors.
Note: with
lets you use columns using just their names and
with(iris, cor(Petal.Length, Petal.Width))
is the same as
cor(iris$Petal.Length, iris$Petal.Width)
.
Finally, we can test if a correlation is significantly different from zero.
with(iris, cor.test(Petal.Length, Petal.Width))
##
## Pearson's product-moment correlation
##
## data: Petal.Length and Petal.Width
## t = 43, df = 148, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9491 0.9730
## sample estimates:
## cor
## 0.9629
A small p-value (less than 0.05) indicates that the observed correlation is significantly different from zero. This can also be seen by the fact that the 95% confidence interval does not span zero.
Sepal.Length and Sepal.Width show little correlation:
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
with(iris, cor(Sepal.Length, Sepal.Width))
## [1] -0.1176
with(iris, cor.test(Sepal.Length, Sepal.Width))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Sepal.Width
## t = -1.4, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269 0.04351
## sample estimates:
## cor
## -0.1176
2.5.5.2 Rank Correlation
Rank correlation is used for ordinal features or if the correlation is
not linear. To show this, we first convert the continuous features in
the Iris dataset into ordered factors (ordinal) with three levels using
the function cut
.
iris_ord <- iris |>
mutate(across(where(is.numeric),
function(x) cut(x, 3,
labels = c("short", "medium", "long"),
ordered = TRUE)))
iris_ord
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <ord> <ord> <ord> <ord> <fct>
## 1 short medium short short setosa
## 2 short medium short short setosa
## 3 short medium short short setosa
## 4 short medium short short setosa
## 5 short medium short short setosa
## 6 short long short short setosa
## 7 short medium short short setosa
## 8 short medium short short setosa
## 9 short medium short short setosa
## 10 short medium short short setosa
## # ℹ 140 more rows
summary(iris_ord)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## short :59 short :47 short :50 short :50
## medium:71 medium:88 medium:54 medium:54
## long :20 long :15 long :46 long :46
## Species
## setosa :50
## versicolor:50
## virginica :50
iris_ord |>
pull(Sepal.Length)
## [1] short short short short short short short
## [8] short short short short short short short
## [15] medium medium short short medium short short
## [22] short short short short short short short
## [29] short short short short short short short
## [36] short short short short short short short
## [43] short short short short short short short
## [50] short long medium long short medium medium
## [57] medium short medium short short medium medium
## [64] medium medium medium medium medium medium medium
## [71] medium medium medium medium medium medium long
## [78] medium medium medium short short medium medium
## [85] short medium medium medium medium short short
## [92] medium medium short medium medium medium medium
## [99] short medium medium medium long medium medium
## [106] long short long medium long medium medium
## [113] long medium medium medium medium long long
## [120] medium long medium long medium medium long
## [127] medium medium medium long long long medium
## [134] medium medium long medium medium medium long
## [141] medium long medium long medium medium medium
## [148] medium medium medium
## Levels: short < medium < long
Two measures for rank correlation are Kendall’s Tau and Spearman’s Rho.
Kendall’s Tau Rank Correlation Coefficient measures the agreement between two rankings (i.e., ordinal features).
iris_ord |>
select(-Species) |>
sapply(xtfrm) |>
cor(method = "kendall")
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1438 0.7419
## Sepal.Width -0.1438 1.0000 -0.3299
## Petal.Length 0.7419 -0.3299 1.0000
## Petal.Width 0.7295 -0.3154 0.9198
## Petal.Width
## Sepal.Length 0.7295
## Sepal.Width -0.3154
## Petal.Length 0.9198
## Petal.Width 1.0000
Note: We have to use xtfrm
to transform the ordered factors into
ranks, i.e., numbers representing the order.
Spearman’s Rho is equal to the Pearson correlation between the rank values of those two features.
iris_ord |>
select(-Species) |>
sapply(xtfrm) |>
cor(method = "spearman")
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1570 0.7938
## Sepal.Width -0.1570 1.0000 -0.3663
## Petal.Length 0.7938 -0.3663 1.0000
## Petal.Width 0.7843 -0.3517 0.9399
## Petal.Width
## Sepal.Length 0.7843
## Sepal.Width -0.3517
## Petal.Length 0.9399
## Petal.Width 1.0000
Spearman’s Rho is much faster to compute on large datasets then Kendall’s Tau.
Comparing the rank correlation results with the Pearson correlation on the original data shows that they are very similar. This indicates that discretizing data does not result in the loss of too much information.
iris |>
select(-Species) |>
cor()
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
2.5.6 Density
Density estimation estimate the probability density function (distribution) of a continuous variable from observed data.
Just plotting the data using points is not very helpful for a single feature.
ggplot(iris, aes(x = Petal.Length, y = 0)) + geom_point()
2.5.6.1 Histograms
A histograms shows more about
the distribution by counting how many values fall within a bin and
visualizing the counts as a bar chart. We use geom_rug
to place marks
for the original data points at the bottom of the histogram.
ggplot(iris, aes(x = Petal.Length)) +
geom_histogram() +
geom_rug(alpha = 1/2)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Two-dimensional distributions can be visualized using 2-d binning or hexagonal bins.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_bin2d(bins = 10) +
geom_jitter(color = "red")
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_hex(bins = 10) +
geom_jitter(color = "red")
2.5.6.2 Kernel Density Estimate (KDE)
Kernel density
estimation is
used to estimate the probability density function (distribution) of a
feature. It works by replacing each value with a kernel function (often
a Gaussian) and then adding them up. The result is an estimated
probability density function that looks like a smoothed version of the
histogram. The bandwidth (bw
) of the kernel controls the amount of
smoothing.
ggplot(iris, aes(Petal.Length)) +
geom_density(bw = .2) +
geom_rug(alpha = 1/2)
Kernel density estimates can also be done in two dimensions.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_density_2d_filled() +
geom_jitter()
2.6 Visualization
Visualization uses several components to convey information:
- Symbols: Circles, dots, lines, bars, …
- Position: Axes (labels and units) and the origin (0/0) are important.
- Length, Size and Area: Should only be used if it faithfully represents information.
- Color: A color should not overpower another. Limit to 3 or 4 colors.
- Angle: Human eyes are not good at comparing angles!
All components of the plot need to convey information. E.g., do not use color just to make it more colorful. All good visualizations show the important patterns clearly.
For space reasons, the chapter on data exploration and visualization was moved from the printed textbook and can now be found in the Data Exploration Web Chapter.
For visualization in R, we will use mainly ggplot2
. The gg in ggplot2
stands for Grammar of Graphics introduced by Wilkinson (2005). The
main idea is that every graph is built from the same basic components:
- the data,
- a coordinate system, and
- visual marks representing the data (geoms).
In ggplot2
, the components are combined using the +
operator.
ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
geom_point()
Since we typically use a Cartesian coordinate system, ggplot
uses that
by default. Each geom_
function uses a stat_
function to calculate
what is visualizes. For example, geom_bar
uses stat_count
to create
a bar chart by counting how often each value appears in the data (see
? geom_bar
). geom_point
just uses the stat "identity"
to display
the points using the coordinates as they are.
Additional components like the main title, axis labels, and different scales can be also added. A great introduction can be found in the Section on Data Visualization (Wickham, Çetinkaya-Rundel, and Grolemund 2023), and very useful is RStudio’s Data Visualization Cheatsheet.
Next, we go through the basic visualizations used when working with data.
2.6.1 Histogram
Histograms show the distribution of a single continuous feature.
ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)
If the data contains groups, then the group information can be easily added as an aesthetic to the histogram.
ggplot(iris, aes(Petal.Width)) +
geom_histogram(bins = 20, aes(fill = Species))
The bars appear stacked with a green blocks on top pf blue blocks.
To display the three distributions behind each other, we change position
for placement and make the bars slightly translucent using alpha
.
ggplot(iris, aes(Petal.Width)) +
geom_histogram(bins = 20, aes(fill = Species), alpha = .5, position = 'identity')
2.6.2 Boxplot
Boxplots are used to compare the distribution of a feature between different groups. The horizontal line in the middle of the boxes are the group-wise medians, the boxes span the interquartile range. The whiskers (vertical lines) span typically 1.4 times the interquartile range. Points that fall outside that range are typically outliers shown as dots.
ggplot(iris, aes(Species, Sepal.Length)) +
geom_boxplot()
The group-wise medians can also be calculated directly.
iris |> group_by(Species) |>
summarize(across(where(is.numeric), median))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5 3.4 1.5 0.2
## 2 versico… 5.9 2.8 4.35 1.3
## 3 virgini… 6.5 3 5.55 2
To compare the distribution of the four features using a ggplot boxplot, we first have to transform the data into long format (i.e., all feature values are combined into a single column).
library(tidyr)
iris_long <- iris |>
mutate(id = row_number()) |>
pivot_longer(1:4)
ggplot(iris_long, aes(name, value)) +
geom_boxplot() +
labs(y = "Original value")
This visualization is only useful if all features have roughly the same range. The data can be scaled first to compare the distributions.
library(tidyr)
iris_long_scaled <- iris |>
scale_numeric() |>
mutate(id = row_number()) |> pivot_longer(1:4)
ggplot(iris_long_scaled, aes(name, value)) +
geom_boxplot() +
labs(y = "Scaled value")
2.6.3 Scatter plot
Scatter plots show the relationship between two continuous features.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species))
We can add a regression using geom_smooth
with the
linear model method to show that there is a
linear relationship between the two variables. A confidence interval
for the regression is also shown.
This can be suppressed using se = FALSE
.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species)) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
We can also perform group-wise linear regression by adding the color
aesthetic also to geom_smooth
.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(color = Species)) +
geom_smooth(method = "lm", aes(color = Species))
## `geom_smooth()` using formula = 'y ~ x'
The same can be achieved by using the color aesthetic in the qqplot
call,
then it applies to all geoms.
2.6.4 Scatter Plot Matrix
A scatter plot matrix show the relationship between all pairs of features by arranging panels in a matrix. First, lets look at a regular R-base plot.
pairs(iris, col = iris$Species)
The package GGally
provides a way more sophisticated visualization.
library("GGally")
ggpairs(iris, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Additional plots (histograms, density estimates and box plots) and correlation coefficients are shown in different panels. See the Data Quality section for a description of how to interpret the different panels.
2.6.5 Matrix Visualization
Matrix visualization shows the values in the matrix using a color scale.
We need the long format for tidyverse.
iris_long <- as_tibble(iris_matrix) |>
mutate(id = row_number()) |>
pivot_longer(1:4)
head(iris_long)
## # A tibble: 6 × 3
## id name value
## <int> <chr> <dbl>
## 1 1 Sepal.Length 5.1
## 2 1 Sepal.Width 3.5
## 3 1 Petal.Length 1.4
## 4 1 Petal.Width 0.2
## 5 2 Sepal.Length 4.9
## 6 2 Sepal.Width 3
ggplot(iris_long, aes(x = name, y = id)) +
geom_tile(aes(fill = value))
Smaller values are darker. Package seriation
provides a simpler
plotting function.
library(seriation)
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
##
## Attaching package: 'seriation'
## The following object is masked from 'package:lattice':
##
## panel.lines
ggpimage(iris_matrix, prop = FALSE)
We can scale the features to z-scores to make them better comparable.
This reveals red and blue blocks. Each row is a flower and the flowers in the Iris dataset are sorted by species. The blue blocks for the top 50 flowers show that these flowers are smaller than average for all but Sepal.Width and the red blocks show that the bottom 50 flowers are larger for most features.
Often, reordering data matrices help with visualization. A reordering technique is called seriation. Ir reorders rows and columns to place more similar points closer together.
We see that the rows (flowers) are organized from very blue to very red and the features are reordered to move Sepal.Width all the way to the right because it is very different from the other features.
2.6.6 Correlation Matrix
A correlation matrix contains the correlation between features.
cm1 <- iris |>
select(-Species) |>
as.matrix() |>
cor()
cm1
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Length 1.0000 -0.1176 0.8718
## Sepal.Width -0.1176 1.0000 -0.4284
## Petal.Length 0.8718 -0.4284 1.0000
## Petal.Width 0.8179 -0.3661 0.9629
## Petal.Width
## Sepal.Length 0.8179
## Sepal.Width -0.3661
## Petal.Length 0.9629
## Petal.Width 1.0000
Package ggcorrplot
provides a visualization for correlation matrices.
library(ggcorrplot)
ggcorrplot(cm1)
Package seriation
provides a reordered version for this plot using a
heatmap.
gghmap(cm1, prop = TRUE)
Correlations can also be calculates between objects by transposing the data matrix.
cm2 <- iris |>
select(-Species) |>
as.matrix() |>
t() |>
cor()
ggcorrplot(cm2)
Object-to-object correlations can be used as a measure of similarity. The dark red blocks indicate different species.
2.6.7 Parallel Coordinates Plot
Parallel coordinate plots can visualize several features in a single plot. Lines connect the values for each object (flower).
library(GGally)
ggparcoord(iris, columns = 1:4, groupColumn = 5)
The plot can be improved by reordering the variables to place correlated features next to each other.
o <- seriate(as.dist(1-cor(iris[,1:4])), method = "BBURCG")
get_order(o)
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 3 4 1 2
ggparcoord(iris,
columns = as.integer(get_order(o)),
groupColumn = 5)
2.6.8 Star Plot
Star plots are a type of radar chart to visualize three or more quantitative variables represented on axes starting from the plot’s origin. R-base offers a simple star plot. We plot the first 5 flowers of each species.
flowers_5 <- iris[c(1:5, 51:55, 101:105), ]
flowers_5
## # A tibble: 15 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 7 3.2 4.7 1.4 versic…
## 7 6.4 3.2 4.5 1.5 versic…
## 8 6.9 3.1 4.9 1.5 versic…
## 9 5.5 2.3 4 1.3 versic…
## 10 6.5 2.8 4.6 1.5 versic…
## 11 6.3 3.3 6 2.5 virgin…
## 12 5.8 2.7 5.1 1.9 virgin…
## 13 7.1 3 5.9 2.1 virgin…
## 14 6.3 2.9 5.6 1.8 virgin…
## 15 6.5 3 5.8 2.2 virgin…
stars(flowers_5[, 1:4], ncol = 5)
2.6.9 More Visualizations
A well organized collection of visualizations with code can be found at The R Graph Gallery.
2.7 Exercises*
The R package palmerpenguins contains measurements for penguin of different species from the Palmer Archipelago, Antarctica. Install the package. It provides a CSV file which can be read in the following way:
library("palmerpenguins")
penguins <- read_csv(path_to_file("penguins.csv"))
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_m...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm
## <chr> <chr> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 19.3
## 6 Adelie Torgersen 39.3 20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, year <dbl>
Create in RStudio a new R Markdown document. Apply the code in the sections of this chapter to the data set to answer the following questions.
- What is the scale of measurement for each column?
- Are there missing values in the data? How much data is missing?
- Compute and discuss basic statistics.
- Group the penguins by species, island or sex. What can you find out?
- Can you identify correlations?
- Apply histograms, boxplots, scatterplots and correlation matrix visualization. What do the visualizations show.
Make sure your markdown document contains now a well formatted report.
Use the Knit
button to create a HTML document.