2 Data

Data for data mining is typically organized in tabular form, with rows containing the objects of interest and columns representing attributes describing the objects. We will discuss topics like data quality, sampling, feature selection, and how to measure similarities between objects and features. The second part of this chapter deals with data exploration and visualization.

Packages Used in this Chapter

pkgs <- c("arules", "caret", "factoextra", "GGally", 
          "ggcorrplot", "hexbin", "palmerpenguins", "plotly", 
          "proxy", "seriation", "tidyverse")

pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are:

2.1 Types of Data

2.1.1 Attributes and Measurement

The values of features can be measured on several scales ranging from simple labels all the way to numbers. The scales come in four levels.

Scale Name Description Operations Statistics R
Nominal just a label (e.g., red, green) \(==, !=\) counts factor
Ordinal label with order (e.g., small, med., large) \(<, >\) median ordered factor
Interval difference between two values is meaningful (regular number) \(+, -\) mean, sd numeric
Ratio has a natural zero (e.g., count, distance) \(/, *\) percent numeric

The scales build on each other meaning that an ordinal variable also has the characteristics of a nominal variable with the added order information. We often do not differentiate between interval and ratio scale because we rarely not need to calculate percentages or other statistics that require a meaningful zero value.

Nominal data is created using factor(). If the factor levels are not specified, then they are created in alphabetical order.

factor(c("red", "green", "green", "blue"))
## [1] red   green green blue 
## Levels: blue green red

Ordinal data is created using ordered(). The levels specify the order.

ordered(c("S", "L", "M", "S"), 
       levels = c("S", "M", "L"))
## [1] S L M S
## Levels: S < M < L

Ratio/interval data is created as a simple vector.

c(1, 2, 3, 4, 3, 3)
## [1] 1 2 3 4 3 3

2.1.2 The Iris Dataset

We will use a toy dataset that comes with R. Fisher’s iris dataset gives the measurements in centimeters of the variables sepal length, sepal width petal length, and petal width representing the features for 150 flowers (the objects). The dataset contains 50 flowers from each of 3 species of iris. The species are Iris Setosa, Iris Versicolor, and Iris Virginica. For more details see ? iris.

We load the iris data set. Datasets that come with R or R packages can be loaded with data(). The standard format for data in R is a data.frame. We convert the data.frame into a tidyverse tibble.

library(tidyverse)
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

We see that the data contains 150 rows (flowers) and 5 features. tibbles only show the first few rows and do not show all features, if they do not fit the screen width. We can call print and define how many rows to show using parameter n and force print to show all features by changing the width to infinity.

print(iris, n = 3, width = Inf)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## # ℹ 147 more rows

2.2 Data Quality

Assessing the quality of the available data is crucial before we start using the data. Start with summary statistics for each column to identify outliers and missing values. The easiest way is to use the base R function summary().

summary(iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

You can also summarize individual columns using tidyverse’s dplyr functions.

iris |> 
  summarize(mean = mean(Sepal.Length))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  5.84

Using across(), multiple columns can be summarized. Un the following, we calculate all numeric columns using the mean function.

iris |> 
  summarize(across(where(is.numeric), mean))
## # A tibble: 1 × 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         5.84        3.06         3.76        1.20

To find outliers or data problems, you need to look for very small values (often a suspicious large number of zeros) using min and for extremely large values using max. Comparing median and mean tells us if the distribution is symmetric.

A visual method to inspect the data is to use a scatterplot matrix (we use here ggpairs() from package GGally). In this plot, we can visually identify noise data points and outliers (points that are far from the majority of other points).

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(iris, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

This useful visualization combines many visualizations used to understand the data and check for quality issues. Rows and columns are the features in the data. We have also specified the aesthetic that we want to group each species using a different color.

  • The visualizations in the diagonal panels show the smoothed histograms with the distribution for each feature. The plot tries to pick a good number of bins for the histogram (see messages above). The distribution can be checked if it is close to normal, unimodal or highly skewed. Also, we can see if the different groups are overlapping or separable for each feature. For example, the three distributions for Sepal.Width are almost identical meaning that it is hard to distinguish between the different species using this feature alone. Petal.Lenght and Petal.Width are much better.

  • The lower-left triangle panels contain scatterplots for all pairs features. These are useful to see if there if features are correlated (the pearson correlation coefficient if printed in the upper-right triangle). For example, Petal.Length and Petal.Width are highly correlated overall This makes sense since larger plants will have both longer and wider petals. Inside the Setosa group this correlation it is a lot weaker. We can also see if groups are well separated using projections on two variables. Almost all panels show that Setosa forms a point cloud well separated from the other two classes while Versicolor and Virginica overlap. We can also see outliers that are far from the other data points in its group. See if you can spot the one red dot that is far away from all others.

  • The last row/column represents in this data set the class label Species. It is a nominal variable so the plots are different. The bottom row panels show (regular) histograms. The last column shows boxplots to represent the distribution of the different features by group. Dots represent outliers. Finally, the bottom-right panel contains the counts for the different groups as a barplot. In this data set, each group has the same number of observations.

Many data mining methods require complete data, that is the data cannot contain missing values (NA). To remove missing values and duplicates (identical data points which might be a mistake in the data), we often do this:

clean.data <- iris |> 
  drop_na() |> 
  unique()

summary(clean.data)
##   Sepal.Length   Sepal.Width    Petal.Length 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60  
##  Median :5.80   Median :3.00   Median :4.30  
##  Mean   :5.84   Mean   :3.06   Mean   :3.75  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90  
##   Petal.Width         Species  
##  Min.   :0.10   setosa    :50  
##  1st Qu.:0.30   versicolor:50  
##  Median :1.30   virginica :49  
##  Mean   :1.19                  
##  3rd Qu.:1.80                  
##  Max.   :2.50

Note that one non-unique case is gone leaving only 149 flowers. The data did not contain missing values, but if it did, they would also have been dropped. Typically, you should spend a lot more time on data cleaning.

2.3 Data Preprocessing

2.3.1 Aggregation

Data often contains groups and we want to compare these groups. We group the iris dataset by species and then calculate a summary statistic for each group.

iris |> 
  group_by(Species) |> 
  summarize(across(everything(), mean))
## # A tibble: 3 × 5
##   Species  Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>           <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa           5.01        3.43         1.46       0.246
## 2 versico…         5.94        2.77         4.26       1.33 
## 3 virgini…         6.59        2.97         5.55       2.03
iris |> 
  group_by(Species) |> 
  summarize(across(everything(), median))
## # A tibble: 3 × 5
##   Species  Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>           <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa            5           3.4         1.5          0.2
## 2 versico…          5.9         2.8         4.35         1.3
## 3 virgini…          6.5         3           5.55         2

Using this information, we can compare how features differ between groups.

2.3.2 Sampling

Sampling is often used in data mining to reduce the dataset size before modeling or visualization.

2.3.2.1 Random Sampling

The built-in sample function can sample from a vector. Here we sample with replacement.

sample(c("A", "B", "C"), size = 10, replace = TRUE)
##  [1] "A" "A" "C" "C" "A" "B" "C" "B" "A" "C"

We often want to sample rows from a dataset. This can be done by sampling without replacement from a vector with row indices (using the functions seq() and nrow()). The sample vector is then used to subset the rows of the dataset.

take <- sample(seq(nrow(iris)), size = 15)
take
##  [1]  20  30  60   9  34  71  46  50  78   1  66 144  86  15
## [15] 143
iris[take, ]
## # A tibble: 15 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.8          1.5         0.3 setosa 
##  2          4.7         3.2          1.6         0.2 setosa 
##  3          5.2         2.7          3.9         1.4 versic…
##  4          4.4         2.9          1.4         0.2 setosa 
##  5          5.5         4.2          1.4         0.2 setosa 
##  6          5.9         3.2          4.8         1.8 versic…
##  7          4.8         3            1.4         0.3 setosa 
##  8          5           3.3          1.4         0.2 setosa 
##  9          6.7         3            5           1.7 versic…
## 10          5.1         3.5          1.4         0.2 setosa 
## 11          6.7         3.1          4.4         1.4 versic…
## 12          6.8         3.2          5.9         2.3 virgin…
## 13          6           3.4          4.5         1.6 versic…
## 14          5.8         4            1.2         0.2 setosa 
## 15          5.8         2.7          5.1         1.9 virgin…

dplyr from tidyverse lets us sample rows from tibbles directly using slice_sample(). I set the random number generator seed to make the results reproducible.

set.seed(1000)
s <- iris |> 
  slice_sample(n = 15)

library(GGally)
ggpairs(s, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Instead of n you can also specify the proportion of rows to select using prob.

2.3.2.2 Stratified Sampling

Stratified sampling is a method of sampling from a population which can be partitioned into subpopulations, while controlling the proportions of the subpopulation in the resulting sample.

In the following, the subpopulations are the different types of species and we want to make sure to sample the same number (5) flowers from each. This can be achieved by first grouping the data by species and then sampling a number of flowers from each group.

set.seed(1000)

s2 <- iris |> 
  group_by(Species) |>
  slice_sample(n = 5) |>
  ungroup()

library(GGally)
ggpairs(s2, aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

More sophisticated sampling procedures are implemented in the package sampling.

2.3.3 Dimensionality Reduction

The number of features is often called the dimensional of the data following the idea that each feature (at least the numeric features) can be seen as an axis of the data. High-dimensional data is harder to analyze by the user (e.g., visualize). It also is problematic for many data mining algorithms since it requires more memory and computational resources.

Dimensionality reduction tries to represent high-dimensional data in a low-dimensional space so that the low-dimensional representation retains some meaningful properties (e.g., information about similarity or distances) of the original data. Dimensionality reduction is used for visualization and as a prepossessing technique before using data mining models.

A special case of dimensionality reduction is feature selection (see Feature Selection* in Chapter 3).

2.3.3.1 Principal Components Analysis (PCA)

PCA calculates principal components (a set of new orthonormal basis vectors in the data space) from data points such that the first principal component explains the most variability in the data, the second the next most and so on. In data analysis, PCA is used to project high-dimensional data points onto the first few (typically two) principal components for visualization as a scatter plot and as preprocessing for modeling (e.g., before k-means clustering). Points that are closer together in the high-dimensional original space, tend also be closer together when projected into the lower-dimensional space,

We can use an interactive 3-d plot (from package plotly) to look at three of the four dimensions of the iris dataset. Note that it is hard to visualize more than 3 dimensions.

plotly::plot_ly(iris, 
                x = ~Sepal.Length, 
                y = ~Petal.Length, 
                z = ~Sepal.Width, 
      color = ~Species, size = 1) |> 
  plotly::add_markers()

The principal components can be calculated from a matrix using the function prcomp(). We select all numeric columns (by unselecting the species column) and convert the tibble into a matrix before the calculation.

pc <- iris |> 
  select(-Species) |> 
  as.matrix() |> 
  prcomp()
summary(pc)
## Importance of components:
##                          PC1    PC2    PC3     PC4
## Standard deviation     2.056 0.4926 0.2797 0.15439
## Proportion of Variance 0.925 0.0531 0.0171 0.00521
## Cumulative Proportion  0.925 0.9777 0.9948 1.00000

How important is each principal component can also be seen using a scree plot. The plot function for the result of the prcomp function visualizes how much variability in the data is explained by each additional principal component.

plot(pc, type = "line")

Note that the first principal component (PC1) explains most of the variability in the iris dataset.

To find out what information is stored in the object pc, we can inspect the raw object (display structure).

str(pc)
## List of 5
##  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
##  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : logi FALSE
##  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

The object pc (like most objects in R) is a list with a class attribute. The list element x contains the data points projected on the principal components. We can convert the matrix into a tibble and add the species column from the original dataset back (since the rows are in the same order), and then display the data projected on the first two principal components.

iris_projected <- as_tibble(pc$x) |> 
  add_column(Species = iris$Species)

ggplot(iris_projected, aes(x = PC1, y = PC2, color = Species)) + 
  geom_point()

Flowers that are displayed close together in this projection are also close together in the original 4-dimensional space. Since the first principal component represents most of the variability, we can also show the data projected only on PC1.

ggplot(iris_projected, 
  aes(x = PC1, y = 0, color = Species)) + 
  geom_point() +
  scale_y_continuous(expand=c(0,0)) +
  theme(axis.text.y = element_blank(),
      axis.title.y = element_blank()
  )

We see that we can perfectly separate the species Setosa using just the first principal component. The other two species are harder to separate.

A plot of the projected data with the original axes added as arrows is called a biplot. If the arrows (original axes) align roughly with the axes of the projection, then they are correlated (linearly dependent).

We can also display only the old and new axes.

We see Petal.Width and Petal.Length point in the same direction which indicates that they are highly correlated. They are also roughly aligned with PC1 (called Dim1 in the plot) which means that PC1 represents most of the variability of these two variables. Sepal.Width is almost aligned with the y-axis and therefore it is represented by PC2 (Dim2). Petal.Width/Petal.Length and Sepal.Width are almost at 90 degrees, indicating that they are close to uncorrelated. Sepal.Length is correlated to all other variables and represented by both, PC1 and PC2 in the projection.

There exist other methods to embed data from higher dimensions into a lower-dimensional space. A popular method to project data into lower dimensions for visualization is t-distributed stochastic neighbor embedding (t-SNE) available in package Rtsne.

2.3.3.2 Multi-Dimensional Scaling (MDS)

MDS is similar to PCA. Instead of data points, it starts with pairwise distances (i.e., a distance matrix) and produces a space where points are placed to represent these distances as well as possible. The axes in this space are called components and are similar to the principal components in PCA.

First, we calculate a distance matrix (Euclidean distances) from the 4-d space of the iris dataset.

d <- iris |> 
  select(-Species) |> 
  dist()

Metric (classic) MDS tries to construct a space where points with lower distances are placed closer together. We project the data represented by a distance matrix on k = 2 dimensions.

fit <- cmdscale(d, k = 2)
colnames(fit) <- c("comp1", "comp2")
fit <- as_tibble(fit) |> 
  add_column(Species = iris$Species)

ggplot(fit, aes(x = comp1, y = comp2, color = Species)) + 
  geom_point()

The resulting projection is similar (except for rotation and reflection) to the result of the projection using PCA.

2.3.3.3 Non-Parametric Multidimensional Scaling

Non-parametric multidimensional scaling performs MDS while relaxing the need of linear relationships. Methods are available in package MASS as functions isoMDS() (implements isoMAP) and sammon().

2.3.3.4 Other Nonlinear Dimensionality Reduction Methods

Nonlinear dimensionality reduction is also called manifold learning or creating a low-Dimensional embedding. Many methods are available with varying properties exist. Some popular methods are:

2.3.4 Feature Subset Selection

Feature selection is the process of identifying the features that are used to create a model. We will talk about feature selection when we discuss classification models in Chapter 3 in Feature Selection*.

2.3.5 Discretization

Some data mining methods require discrete data. Discretization converts continuous features into discrete features. As an example, we will discretize the continuous feature Petal.Width. Before we perform discretization, we should look at the distribution and see if it gives us an idea how we should group the continuous values into a set of discrete values. A histogram visualizes the distribution of a single continuous feature.

ggplot(iris, aes(x = Petal.Width)) + 
  geom_histogram(binwidth = .2)

The bins in the histogram represent a discretization using a fixed bin width. The R function cut() performs equal interval width discretization creating a vector of type factor where each level represents an interval.

iris |> 
  pull(Sepal.Width) |> 
  cut(breaks = 3)
##   [1] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##   [6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [11] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4]
##  [16] (3.6,4.4] (3.6,4.4] (2.8,3.6] (3.6,4.4] (3.6,4.4]
##  [21] (2.8,3.6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [26] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [31] (2.8,3.6] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
##  [36] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [41] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (3.6,4.4]
##  [46] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4] (2.8,3.6]
##  [51] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]  
##  [56] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]  
##  [61] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6]
##  [66] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [71] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6]
##  [76] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2,2.8]  
##  [81] (2,2.8]   (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6]
##  [86] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]  
##  [91] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [96] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]  
## [101] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [106] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6]
## [111] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]  
## [116] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2,2.8]   (2,2.8]  
## [121] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6]
## [126] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6]
## [131] (2,2.8]   (3.6,4.4] (2,2.8]   (2,2.8]   (2,2.8]  
## [136] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [141] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6]
## [146] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
## Levels: (2,2.8] (2.8,3.6] (3.6,4.4]

Other discretization methods include equal frequency discretization or using k-means clustering. These methods are implemented by several R packages. We use here the implementation in package arules and visualize the results as histograms with blue lines to separate intervals assigned to each discrete value.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
iris |> pull(Petal.Width) |> 
  discretize(method = "interval", breaks = 3)
##   [1] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##   [6] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [11] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [16] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [21] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [26] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [31] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [36] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [41] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [46] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [51] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [56] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [61] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [66] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [71] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [76] [0.9,1.7) [0.9,1.7) [1.7,2.5] [0.9,1.7) [0.9,1.7)
##  [81] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [86] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [91] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [96] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
## [101] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [106] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [111] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [116] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7)
## [121] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [126] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7)
## [131] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7) [0.9,1.7)
## [136] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [141] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [146] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## attr(,"discretized:breaks")
## [1] 0.1 0.9 1.7 2.5
## attr(,"discretized:method")
## [1] interval
## Levels: [0.1,0.9) [0.9,1.7) [1.7,2.5]

To show the differences between the methods, we use the three discretization methods and draw blue lines in the histogram to show how they cut the data.

ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(
      xintercept = iris |> 
      pull(Petal.Width) |> 
      discretize(method = "interval", breaks = 3, onlycuts = TRUE),
    color = "blue"
  ) +
  labs(title = "Discretization: interval", 
       subtitle = "Blue lines are discretization boundaries")
ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(
    xintercept = iris |> 
    pull(Petal.Width) |> 
    discretize(method = "frequency", breaks = 3, onlycuts = TRUE),
   color = "blue"
  ) +
  labs(title = "Discretization: frequency", 
       subtitle = "Blue lines are discretization boundaries")
ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(
    xintercept = iris |> 
    pull(Petal.Width) |> 
    discretize(method = "cluster", breaks = 3, onlycuts = TRUE),
   color = "blue"
  ) +
  labs(title = "Discretization: cluster", 
       subtitle = "Blue lines are discretization boundaries")

The user needs to decide on the number of intervals and the used method.

2.3.6 Variable Transformation: Standardization

Standardizing (scaling, normalizing) the range of features values is important to make them comparable. The most popular method is to convert the values of each feature to z-scores. by subtracting the mean (centering) and dividing by the standard deviation (scaling). The standardized feature will have a mean of zero and are measured in standard deviations from the mean. Positive values indicate how many standard deviation the original feature value was above the average. Negative standardized values indicate below-average values.

R-base provides the function scale() to standardize the columns in a data.frame. Tidyverse currently does not have a simple scale function, so we make one It mutates all numeric columns using an anonymous function that calculates the z-score.

scale_numeric <- function(x) 
  x |> 
  mutate(across(where(is.numeric), 
                function(y) (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)))
iris.scaled <- iris |> 
  scale_numeric()
iris.scaled
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1       -0.898      1.02          -1.34       -1.31 setosa 
##  2       -1.14      -0.132         -1.34       -1.31 setosa 
##  3       -1.38       0.327         -1.39       -1.31 setosa 
##  4       -1.50       0.0979        -1.28       -1.31 setosa 
##  5       -1.02       1.25          -1.34       -1.31 setosa 
##  6       -0.535      1.93          -1.17       -1.05 setosa 
##  7       -1.50       0.786         -1.34       -1.18 setosa 
##  8       -1.02       0.786         -1.28       -1.31 setosa 
##  9       -1.74      -0.361         -1.34       -1.31 setosa 
## 10       -1.14       0.0979        -1.28       -1.44 setosa 
## # ℹ 140 more rows
summary(iris.scaled)
##   Sepal.Length      Sepal.Width      Petal.Length   
##  Min.   :-1.8638   Min.   :-2.426   Min.   :-1.562  
##  1st Qu.:-0.8977   1st Qu.:-0.590   1st Qu.:-1.222  
##  Median :-0.0523   Median :-0.132   Median : 0.335  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.6722   3rd Qu.: 0.557   3rd Qu.: 0.760  
##  Max.   : 2.4837   Max.   : 3.080   Max.   : 1.780  
##   Petal.Width           Species  
##  Min.   :-1.442   setosa    :50  
##  1st Qu.:-1.180   versicolor:50  
##  Median : 0.132   virginica :50  
##  Mean   : 0.000                  
##  3rd Qu.: 0.788                  
##  Max.   : 1.706

The standardized feature has a mean of zero and most “normal” values will fall in the range \([-3,3]\) and is measured in standard deviations from the average. Negative values mean smaller than the average and positive values mean larger than the average.

2.4 Measures of Similarity and Dissimilarity

Proximities help with quantifying how similar two objects are. Similariy is a concept from geometry. The best-known way to define similarity is Euclidean distance, but proximities can be measured in different ways depending on the information we have about the objects.

R stores proximity as dissimilarities/distances matrices. Similarities are first converted to dissimilarities. Distances are symmetric, i.e., the distance from A to B is the same as the distance from B to A. R therefore stores only a triangle (typically the lower triangle) of the distance matrix.

2.4.1 Minkowsky Distances

The Minkowsky distance is a family of metric distances including Euclidean and Manhattan distance. To avoid one feature to dominate the distance calculation, scaled data is typically used. We select the first 5 flowers for this example.

iris_sample <- iris.scaled |> 
  select(-Species) |> 
  slice(1:5)
iris_sample
## # A tibble: 5 × 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1       -0.898      1.02          -1.34       -1.31
## 2       -1.14      -0.132         -1.34       -1.31
## 3       -1.38       0.327         -1.39       -1.31
## 4       -1.50       0.0979        -1.28       -1.31
## 5       -1.02       1.25          -1.34       -1.31

Different types of Minkowsky distance matrices between the first 5 flowers can be calculated using dist().

dist(iris_sample, method = "euclidean")
##        1      2      3      4
## 2 1.1723                     
## 3 0.8428 0.5216              
## 4 1.1000 0.4326 0.2829       
## 5 0.2593 1.3819 0.9883 1.2460
dist(iris_sample, method = "manhattan")
##        1      2      3      4
## 2 1.3887                     
## 3 1.2280 0.7570              
## 4 1.5782 0.6484 0.4635       
## 5 0.3502 1.4973 1.3367 1.6868
dist(iris_sample, method = "maximum")
##        1      2      3      4
## 2 1.1471                     
## 3 0.6883 0.4589              
## 4 0.9177 0.3623 0.2294       
## 5 0.2294 1.3766 0.9177 1.1471

We see that only the lower triangle of the distance matrices are stored (note that rows start with row 2).

2.4.2 Distances for Binary Data

Binary data can be encodes as 0 and 1 (numeric) or TRUE and FALSE (logical).

b <- rbind(
  c(0,0,0,1,1,1,1,0,0,1),
  c(0,0,1,1,1,0,0,1,0,0)
  )
b
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    1    1    1    1    0    0     1
## [2,]    0    0    1    1    1    0    0    1    0     0
b_logical <- apply(b, MARGIN = 2, as.logical)
b_logical
##       [,1]  [,2]  [,3] [,4] [,5]  [,6]  [,7]  [,8]  [,9]
## [1,] FALSE FALSE FALSE TRUE TRUE  TRUE  TRUE FALSE FALSE
## [2,] FALSE FALSE  TRUE TRUE TRUE FALSE FALSE  TRUE FALSE
##      [,10]
## [1,]  TRUE
## [2,] FALSE

2.4.2.1 Hamming Distance

The Hamming distance is the number of mismatches between two binary vectors. For 0-1 data this is equivalent to the Manhattan distance and also the squared Euclidean distance.

dist(b, method = "manhattan")
##   1
## 2 5
dist(b, method = "euclidean")^2
##   1
## 2 5

2.4.2.2 Jaccard Index

The Jaccard index is a similarity measure that focuses on matching 1s. R converts the similarity into a dissimilarity using \(d_{J} = 1 - s_{J}\).

dist(b, method = "binary")
##        1
## 2 0.7143

2.4.3 Distances for Mixed Data

Most distance measures work only on numeric data. Often, we have a mixture of numbers and nominal or ordinal features like this data:

people <- tibble(
  height = c(      160,    185,    170),
  weight = c(       52,     90,     75),
  sex    = c( "female", "male", "male")
)
people
## # A tibble: 3 × 3
##   height weight sex   
##    <dbl>  <dbl> <chr> 
## 1    160     52 female
## 2    185     90 male  
## 3    170     75 male

It is important that nominal features are stored as factors and not character (<chr>).

people <- people |> 
  mutate(across(where(is.character), factor))
people
## # A tibble: 3 × 3
##   height weight sex   
##    <dbl>  <dbl> <fct> 
## 1    160     52 female
## 2    185     90 male  
## 3    170     75 male

2.4.3.1 Gower’s Coefficient

The Gower’s coefficient of similarity works with mixed data by calculating the appropriate similarity for each feature and then aggregating them into a single measure. The package proxy implements Gower’s coefficient converted into a distance.

library(proxy)
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
d_Gower <- dist(people, method = "Gower")
d_Gower
##        1      2
## 2 1.0000       
## 3 0.6684 0.3316

Gower’s coefficient calculation implicitly scales the data because it calculates distances on each feature individually, so there is no need to scale the data first.

2.4.3.2 Using Euclidean Distance with Mixed Data

Sometimes methods (e.g., k-means) can only use Euclidean distance. In this case, nominal features can be converted into 0-1 dummy variables. After scaling, Euclidean distance will result in a usable distance measure.

We use package caret to create dummy variables.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
data_dummy <- dummyVars(~., people) |> 
  predict(people)
data_dummy
##   height weight sex.female sex.male
## 1    160     52          1        0
## 2    185     90          0        1
## 3    170     75          0        1

Note that feature sex has now two columns. If we want that height, weight and sex have the same influence on the distance measure, then we need to weight the sex columns by 1/2 after scaling.

weight_matrix <- matrix(c(1, 1, 1/2, 1/2), 
                        ncol = 4, 
                        nrow = nrow(data_dummy), 
                        byrow = TRUE)
data_dummy_scaled <- scale(data_dummy) * weight_matrix

d_dummy <- dist(data_dummy_scaled)
d_dummy
##       1     2
## 2 3.064      
## 3 1.891 1.427

The distance using dummy variables is consistent with Gower’s distance. However, note that Gower’s distance is scaled between 0 and 1 while the Euclidean distance is not.

ggplot(tibble(d_dummy, d_Gower), aes(x = d_dummy, y = d_Gower)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## Don't know how to automatically pick scale for object of
## type <dist>. Defaulting to continuous.
## Don't know how to automatically pick scale for object of
## type <dist>. Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'

2.4.4 More Proximity Measures

The package proxy implements a wide array of proximity measures (similarity measures are converted into distances).

library(proxy)
pr_DB$get_entry_names()
##  [1] "Jaccard"         "Kulczynski1"     "Kulczynski2"    
##  [4] "Mountford"       "Fager"           "Russel"         
##  [7] "simple matching" "Hamman"          "Faith"          
## [10] "Tanimoto"        "Dice"            "Phi"            
## [13] "Stiles"          "Michael"         "Mozley"         
## [16] "Yule"            "Yule2"           "Ochiai"         
## [19] "Simpson"         "Braun-Blanquet"  "cosine"         
## [22] "angular"         "eJaccard"        "eDice"          
## [25] "correlation"     "Chi-squared"     "Phi-squared"    
## [28] "Tschuprow"       "Cramer"          "Pearson"        
## [31] "Gower"           "Euclidean"       "Mahalanobis"    
## [34] "Bhjattacharyya"  "Manhattan"       "supremum"       
## [37] "Minkowski"       "Canberra"        "Wave"           
## [40] "divergence"      "Kullback"        "Bray"           
## [43] "Soergel"         "Levenshtein"     "Podani"         
## [46] "Chord"           "Geodesic"        "Whittaker"      
## [49] "Hellinger"       "fJaccard"

Note that loading the package proxy overwrites the default dist() function in R. You can specify which dist function to use by specifying the package in the call. For example stats::dist() calls the default function in R (the package stats is part of R) while proxy::dist() calls the version in the package proxy.

2.5 Data Exploration*

The following code covers the important part of data exploration. For space reasons, this chapter was moved from the printed textbook to this Data Exploration Web Chapter.

We will use again the iris dataset.

library(tidyverse)
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

2.5.1 Basic statistics

Get summary statistics (using base R)

summary(iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Get mean and standard deviation for sepal length.

iris |> 
  summarize(avg_Sepal.Length = mean(Sepal.Length), 
            sd_Sepal.Length = sd(Sepal.Length))
## # A tibble: 1 × 2
##   avg_Sepal.Length sd_Sepal.Length
##              <dbl>           <dbl>
## 1             5.84           0.828

Data with missing values will result in statistics of NA. Adding the parameter na.rm = TRUE can be used in most statistics functions to ignore missing values.

mean(c(1, 2, NA, 3, 4, 5))
## [1] NA
mean(c(1, 2, NA, 3, 4, 5),  na.rm = TRUE)
## [1] 3

Outliers are typically the smallest or the largest values of a feature. To make the mean more robust against outliers, we can trim 10% of observations from each end of the distribution.

iris |>
  summarize(
    avg_Sepal.Length = mean(Sepal.Length),
    trimmed_avg_Sepal.Length = mean(Sepal.Length, trim = .1)
  )
## # A tibble: 1 × 2
##   avg_Sepal.Length trimmed_avg_Sepal.Length
##              <dbl>                    <dbl>
## 1             5.84                     5.81

Sepal length does not have outliers, so the trimmed mean is almost identical.

To calculate a summary for a set of features (e.g., all numeric features), tidyverse provides across(where(is.numeric), fun).

iris |> summarize(across(where(is.numeric), mean))
## # A tibble: 1 × 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         5.84        3.06         3.76        1.20
iris |> summarize(across(where(is.numeric), sd))
## # A tibble: 1 × 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1        0.828       0.436         1.77       0.762
iris |> summarize(across(where(is.numeric), 
            list(min = min, 
                 median = median, 
                 max = max)))
## # A tibble: 1 × 12
##   Sepal.Length_min Sepal.Length_median Sepal.Length_max
##              <dbl>               <dbl>            <dbl>
## 1              4.3                 5.8              7.9
## # ℹ 9 more variables: Sepal.Width_min <dbl>,
## #   Sepal.Width_median <dbl>, Sepal.Width_max <dbl>,
## #   Petal.Length_min <dbl>, Petal.Length_median <dbl>,
## #   Petal.Length_max <dbl>, Petal.Width_min <dbl>,
## #   Petal.Width_median <dbl>, Petal.Width_max <dbl>

The median absolute deviation (MAD) is another measure of dispersion.

iris |> summarize(across(where(is.numeric), mad))
## # A tibble: 1 × 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         1.04       0.445         1.85        1.04

2.5.2 Grouped Operations and Calculations

We can use the nominal feature to form groups and then calculate group-wise statistics for the continuous features. We often use group-wise averages to see if they differ between groups.

iris |> 
  group_by(Species) |> 
  summarize(across(Sepal.Length, mean))
## # A tibble: 3 × 2
##   Species    Sepal.Length
##   <fct>             <dbl>
## 1 setosa             5.01
## 2 versicolor         5.94
## 3 virginica          6.59
iris |> 
  group_by(Species) |> 
  summarize(across(where(is.numeric), mean))
## # A tibble: 3 × 5
##   Species  Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>           <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa           5.01        3.43         1.46       0.246
## 2 versico…         5.94        2.77         4.26       1.33 
## 3 virgini…         6.59        2.97         5.55       2.03

We see that the species Virginica has the highest average for all, but Sepal.Width.

The statistical difference between the groups can be tested using ANOVA (analysis of variance).

res.aov <- aov(Sepal.Length ~ Species, data = iris)
summary(res.aov)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2   63.2   31.61     119 <2e-16 ***
## Residuals   147   39.0    0.27                   
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                       diff    lwr    upr p adj
## versicolor-setosa    0.930 0.6862 1.1738     0
## virginica-setosa     1.582 1.3382 1.8258     0
## virginica-versicolor 0.652 0.4082 0.8958     0

The summary shows that there is a significant difference for Sepal.Length between the groups. TukeyHDS evaluates differences between pairs of groups. In this case, all are significantly different. If the data only contains two groups, the t.test can be used.

2.5.3 Tabulate data

We can count the number of flowers for each species.

iris |> 
  group_by(Species) |> 
  summarize(n())
## # A tibble: 3 × 2
##   Species    `n()`
##   <fct>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50

In base R, this can be also done using count(iris$Species).

For the following examples, we discretize the data using cut.

iris_ord <- iris |> 
  mutate(across(where(is.numeric),  
    function(x) cut(x, 3, labels = c("short", "medium", "long"), 
                    ordered = TRUE)))

iris_ord
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##    <ord>        <ord>       <ord>        <ord>       <fct>  
##  1 short        medium      short        short       setosa 
##  2 short        medium      short        short       setosa 
##  3 short        medium      short        short       setosa 
##  4 short        medium      short        short       setosa 
##  5 short        medium      short        short       setosa 
##  6 short        long        short        short       setosa 
##  7 short        medium      short        short       setosa 
##  8 short        medium      short        short       setosa 
##  9 short        medium      short        short       setosa 
## 10 short        medium      short        short       setosa 
## # ℹ 140 more rows
summary(iris_ord)
##  Sepal.Length Sepal.Width Petal.Length Petal.Width
##  short :59    short :47   short :50    short :50  
##  medium:71    medium:88   medium:54    medium:54  
##  long  :20    long  :15   long  :46    long  :46  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50

Cross tabulation is used to find out if two discrete features are related.

tbl <- iris_ord |> 
  select(Sepal.Length, Species) |> 
  table()
tbl
##             Species
## Sepal.Length setosa versicolor virginica
##       short      47         11         1
##       medium      3         36        32
##       long        0          3        17

The table contains the number of rows that contain the combination of values (e.g., the number of flowers with a short Sepal.Length and species Setosa is 47). If a few cells have very large counts and most others have very low counts, then there might be a relationship. For the iris data, we see that species Setosa has mostly a short Sepal.Length, while Versicolor and Virginica have longer sepals.

Creating a cross table with tidyverse is a little more involved and uses pivot operations and grouping.

iris_ord |>
  select(Species, Sepal.Length) |>
### Relationship Between Nominal and Ordinal Features
  pivot_longer(cols = Sepal.Length) |>
  group_by(Species, value) |> 
  count() |> 
  ungroup() |>
  pivot_wider(names_from = Species, values_from = n)
## # A tibble: 3 × 4
##   value  setosa versicolor virginica
##   <ord>   <int>      <int>     <int>
## 1 short      47         11         1
## 2 medium      3         36        32
## 3 long       NA          3        17

We can use a statistical test to determine if there is a significant relationship between the two features. Pearson’s chi-squared test for independence is performed with the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals. The null hypothesis h0 is independence between rows and columns.

tbl |> 
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 112, df = 4, p-value <2e-16

The small p-value indicates that the null hypothesis of independence needs to be rejected. For small counts (cells with counts <5), Fisher’s exact test is better.

fisher.test(tbl)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tbl
## p-value <2e-16
## alternative hypothesis: two.sided

2.5.4 Percentiles (Quantiles)

Quantiles are cutting points dividing the range of a probability distribution into continuous intervals with equal probability. For example, the median is the empirical 50% quantile dividing the observations into 50% of the observations being smaller than the median and the other 50% being larger than the median.

By default quartiles are calculated. 25% is typically called Q1, 50% is called Q2 or the median and 75% is called Q3.

iris |> 
  pull(Petal.Length) |> 
  quantile()
##   0%  25%  50%  75% 100% 
## 1.00 1.60 4.35 5.10 6.90

The interquartile range is a measure for variability that is robust against outliers. It is defined the length Q3 - Q2 which covers the 50% of the data in the middle.

iris |> 
  summarize(IQR = 
  quantile(Petal.Length, probs = 0.75) - 
    quantile(Petal.Length, probs = 0.25))
## # A tibble: 1 × 1
##     IQR
##   <dbl>
## 1   3.5

2.5.5 Correlation

2.5.5.1 Pearson Correlation

Correlation can be used for ratio/interval scaled features. We typically think of the Pearson correlation coefficient between features (columns).

cc <- iris |> 
  select(-Species) |> 
  cor()
cc
##              Sepal.Length Sepal.Width Petal.Length
## Sepal.Length       1.0000     -0.1176       0.8718
## Sepal.Width       -0.1176      1.0000      -0.4284
## Petal.Length       0.8718     -0.4284       1.0000
## Petal.Width        0.8179     -0.3661       0.9629
##              Petal.Width
## Sepal.Length      0.8179
## Sepal.Width      -0.3661
## Petal.Length      0.9629
## Petal.Width       1.0000

cor calculates a correlation matrix with pairwise correlations between features. Correlation matrices are symmetric, but different to distances, the whole matrix is stored.

The correlation between Petal.Length and Petal.Width can be visualized using a scatter plot.

ggplot(iris, aes(Petal.Length, Petal.Width)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

geom_smooth adds a regression line by fitting a linear model (lm). Most points are close to this line indicating strong linear dependence (i.e., high correlation).

We can calculate individual correlations by specifying two vectors.

with(iris, cor(Petal.Length, Petal.Width))
## [1] 0.9629

Note: with lets you use columns using just their names and with(iris, cor(Petal.Length, Petal.Width)) is the same as cor(iris$Petal.Length, iris$Petal.Width).

Finally, we can test if a correlation is significantly different from zero.

with(iris, cor.test(Petal.Length, Petal.Width))
## 
##  Pearson's product-moment correlation
## 
## data:  Petal.Length and Petal.Width
## t = 43, df = 148, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9491 0.9730
## sample estimates:
##    cor 
## 0.9629

A small p-value (less than 0.05) indicates that the observed correlation is significantly different from zero. This can also be seen by the fact that the 95% confidence interval does not span zero.

Sepal.Length and Sepal.Width show little correlation:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_point() +   
  geom_smooth(method = "lm") 
## `geom_smooth()` using formula = 'y ~ x'
with(iris, cor(Sepal.Length, Sepal.Width)) 
## [1] -0.1176
with(iris, cor.test(Sepal.Length, Sepal.Width))
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Sepal.Width
## t = -1.4, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269  0.04351
## sample estimates:
##     cor 
## -0.1176

2.5.5.2 Rank Correlation

Rank correlation is used for ordinal features or if the correlation is not linear. To show this, we first convert the continuous features in the Iris dataset into ordered factors (ordinal) with three levels using the function cut.

iris_ord <- iris |> 
  mutate(across(where(is.numeric), 
    function(x) cut(x, 3, 
                    labels = c("short", "medium", "long"), 
                    ordered = TRUE)))

iris_ord
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##    <ord>        <ord>       <ord>        <ord>       <fct>  
##  1 short        medium      short        short       setosa 
##  2 short        medium      short        short       setosa 
##  3 short        medium      short        short       setosa 
##  4 short        medium      short        short       setosa 
##  5 short        medium      short        short       setosa 
##  6 short        long        short        short       setosa 
##  7 short        medium      short        short       setosa 
##  8 short        medium      short        short       setosa 
##  9 short        medium      short        short       setosa 
## 10 short        medium      short        short       setosa 
## # ℹ 140 more rows
summary(iris_ord)
##  Sepal.Length Sepal.Width Petal.Length Petal.Width
##  short :59    short :47   short :50    short :50  
##  medium:71    medium:88   medium:54    medium:54  
##  long  :20    long  :15   long  :46    long  :46  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50
iris_ord |> 
  pull(Sepal.Length)
##   [1] short  short  short  short  short  short  short 
##   [8] short  short  short  short  short  short  short 
##  [15] medium medium short  short  medium short  short 
##  [22] short  short  short  short  short  short  short 
##  [29] short  short  short  short  short  short  short 
##  [36] short  short  short  short  short  short  short 
##  [43] short  short  short  short  short  short  short 
##  [50] short  long   medium long   short  medium medium
##  [57] medium short  medium short  short  medium medium
##  [64] medium medium medium medium medium medium medium
##  [71] medium medium medium medium medium medium long  
##  [78] medium medium medium short  short  medium medium
##  [85] short  medium medium medium medium short  short 
##  [92] medium medium short  medium medium medium medium
##  [99] short  medium medium medium long   medium medium
## [106] long   short  long   medium long   medium medium
## [113] long   medium medium medium medium long   long  
## [120] medium long   medium long   medium medium long  
## [127] medium medium medium long   long   long   medium
## [134] medium medium long   medium medium medium long  
## [141] medium long   medium long   medium medium medium
## [148] medium medium medium
## Levels: short < medium < long

Two measures for rank correlation are Kendall’s Tau and Spearman’s Rho.

Kendall’s Tau Rank Correlation Coefficient measures the agreement between two rankings (i.e., ordinal features).

iris_ord |> 
  select(-Species) |> 
  sapply(xtfrm) |> 
  cor(method = "kendall")
##              Sepal.Length Sepal.Width Petal.Length
## Sepal.Length       1.0000     -0.1438       0.7419
## Sepal.Width       -0.1438      1.0000      -0.3299
## Petal.Length       0.7419     -0.3299       1.0000
## Petal.Width        0.7295     -0.3154       0.9198
##              Petal.Width
## Sepal.Length      0.7295
## Sepal.Width      -0.3154
## Petal.Length      0.9198
## Petal.Width       1.0000

Note: We have to use xtfrm to transform the ordered factors into ranks, i.e., numbers representing the order.

Spearman’s Rho is equal to the Pearson correlation between the rank values of those two features.

iris_ord |> 
  select(-Species) |> 
  sapply(xtfrm) |> 
  cor(method = "spearman")
##              Sepal.Length Sepal.Width Petal.Length
## Sepal.Length       1.0000     -0.1570       0.7938
## Sepal.Width       -0.1570      1.0000      -0.3663
## Petal.Length       0.7938     -0.3663       1.0000
## Petal.Width        0.7843     -0.3517       0.9399
##              Petal.Width
## Sepal.Length      0.7843
## Sepal.Width      -0.3517
## Petal.Length      0.9399
## Petal.Width       1.0000

Spearman’s Rho is much faster to compute on large datasets then Kendall’s Tau.

Comparing the rank correlation results with the Pearson correlation on the original data shows that they are very similar. This indicates that discretizing data does not result in the loss of too much information.

iris |> 
  select(-Species) |> 
  cor()
##              Sepal.Length Sepal.Width Petal.Length
## Sepal.Length       1.0000     -0.1176       0.8718
## Sepal.Width       -0.1176      1.0000      -0.4284
## Petal.Length       0.8718     -0.4284       1.0000
## Petal.Width        0.8179     -0.3661       0.9629
##              Petal.Width
## Sepal.Length      0.8179
## Sepal.Width      -0.3661
## Petal.Length      0.9629
## Petal.Width       1.0000

2.5.6 Density

Density estimation estimate the probability density function (distribution) of a continuous variable from observed data.

Just plotting the data using points is not very helpful for a single feature.

ggplot(iris, aes(x = Petal.Length, y = 0)) + geom_point()

2.5.6.1 Histograms

A histograms shows more about the distribution by counting how many values fall within a bin and visualizing the counts as a bar chart. We use geom_rug to place marks for the original data points at the bottom of the histogram.

ggplot(iris, aes(x = Petal.Length)) +
  geom_histogram() +
  geom_rug(alpha = 1/2)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Two-dimensional distributions can be visualized using 2-d binning or hexagonal bins.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_bin2d(bins = 10) +
  geom_jitter(color = "red")
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_hex(bins = 10) +
  geom_jitter(color = "red")

2.5.6.2 Kernel Density Estimate (KDE)

Kernel density estimation is used to estimate the probability density function (distribution) of a feature. It works by replacing each value with a kernel function (often a Gaussian) and then adding them up. The result is an estimated probability density function that looks like a smoothed version of the histogram. The bandwidth (bw) of the kernel controls the amount of smoothing.

ggplot(iris, aes(Petal.Length)) +
  geom_density(bw = .2) +
  geom_rug(alpha = 1/2)

Kernel density estimates can also be done in two dimensions.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_density_2d_filled() +
  geom_jitter()

2.6 Visualization

Visualization uses several components to convey information:

  • Symbols: Circles, dots, lines, bars, …
  • Position: Axes (labels and units) and the origin (0/0) are important.
  • Length, Size and Area: Should only be used if it faithfully represents information.
  • Color: A color should not overpower another. Limit to 3 or 4 colors.
  • Angle: Human eyes are not good at comparing angles!

All components of the plot need to convey information. E.g., do not use color just to make it more colorful. All good visualizations show the important patterns clearly.

For space reasons, the chapter on data exploration and visualization was moved from the printed textbook and can now be found in the Data Exploration Web Chapter.

For visualization in R, we will use mainly ggplot2. The gg in ggplot2 stands for Grammar of Graphics introduced by Wilkinson (2005). The main idea is that every graph is built from the same basic components:

  • the data,
  • a coordinate system, and
  • visual marks representing the data (geoms).

In ggplot2, the components are combined using the + operator.

ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
  geom_point()

Since we typically use a Cartesian coordinate system, ggplot uses that by default. Each geom_ function uses a stat_ function to calculate what is visualizes. For example, geom_bar uses stat_count to create a bar chart by counting how often each value appears in the data (see ? geom_bar). geom_point just uses the stat "identity" to display the points using the coordinates as they are.

Additional components like the main title, axis labels, and different scales can be also added. A great introduction can be found in the Section on Data Visualization (Wickham, Çetinkaya-Rundel, and Grolemund 2023), and very useful is RStudio’s Data Visualization Cheatsheet.

Next, we go through the basic visualizations used when working with data.

2.6.1 Histogram

Histograms show the distribution of a single continuous feature.

ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)

If the data contains groups, then the group information can be easily added as an aesthetic to the histogram.

ggplot(iris, aes(Petal.Width)) + 
         geom_histogram(bins = 20, aes(fill = Species))

The bars appear stacked with a green blocks on top pf blue blocks. To display the three distributions behind each other, we change position for placement and make the bars slightly translucent using alpha.

ggplot(iris, aes(Petal.Width)) + 
         geom_histogram(bins = 20, aes(fill = Species), alpha = .5, position = 'identity')

2.6.2 Boxplot

Boxplots are used to compare the distribution of a feature between different groups. The horizontal line in the middle of the boxes are the group-wise medians, the boxes span the interquartile range. The whiskers (vertical lines) span typically 1.4 times the interquartile range. Points that fall outside that range are typically outliers shown as dots.

ggplot(iris, aes(Species, Sepal.Length)) + 
  geom_boxplot()

The group-wise medians can also be calculated directly.

iris |> group_by(Species) |> 
  summarize(across(where(is.numeric), median))
## # A tibble: 3 × 5
##   Species  Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>           <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa            5           3.4         1.5          0.2
## 2 versico…          5.9         2.8         4.35         1.3
## 3 virgini…          6.5         3           5.55         2

To compare the distribution of the four features using a ggplot boxplot, we first have to transform the data into long format (i.e., all feature values are combined into a single column).

library(tidyr)
iris_long <- iris |> 
  mutate(id = row_number()) |> 
  pivot_longer(1:4)

ggplot(iris_long, aes(name, value)) + 
  geom_boxplot() +
  labs(y = "Original value")

This visualization is only useful if all features have roughly the same range. The data can be scaled first to compare the distributions.

library(tidyr)
iris_long_scaled <- iris |> 
  scale_numeric() |> 
  mutate(id = row_number()) |> pivot_longer(1:4)

ggplot(iris_long_scaled, aes(name, value)) + 
  geom_boxplot() +
  labs(y = "Scaled value")

2.6.3 Scatter plot

Scatter plots show the relationship between two continuous features.

ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) + 
  geom_point(aes(color = Species))

We can add a regression using geom_smooth with the linear model method to show that there is a linear relationship between the two variables. A confidence interval for the regression is also shown. This can be suppressed using se = FALSE.

ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) + 
  geom_point(aes(color = Species)) +  
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

We can also perform group-wise linear regression by adding the color aesthetic also to geom_smooth.

ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) + 
  geom_point(aes(color = Species)) +  
  geom_smooth(method = "lm", aes(color = Species))
## `geom_smooth()` using formula = 'y ~ x'

The same can be achieved by using the color aesthetic in the qqplot call, then it applies to all geoms.

2.6.4 Scatter Plot Matrix

A scatter plot matrix show the relationship between all pairs of features by arranging panels in a matrix. First, lets look at a regular R-base plot.

pairs(iris, col = iris$Species)

The package GGally provides a way more sophisticated visualization.

library("GGally")
ggpairs(iris,  aes(color = Species), progress = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Additional plots (histograms, density estimates and box plots) and correlation coefficients are shown in different panels. See the Data Quality section for a description of how to interpret the different panels.

2.6.5 Matrix Visualization

Matrix visualization shows the values in the matrix using a color scale.

iris_matrix <- iris |> select(-Species) |> as.matrix()

We need the long format for tidyverse.

iris_long <- as_tibble(iris_matrix) |> 
  mutate(id = row_number()) |> 
  pivot_longer(1:4)

head(iris_long)
## # A tibble: 6 × 3
##      id name         value
##   <int> <chr>        <dbl>
## 1     1 Sepal.Length   5.1
## 2     1 Sepal.Width    3.5
## 3     1 Petal.Length   1.4
## 4     1 Petal.Width    0.2
## 5     2 Sepal.Length   4.9
## 6     2 Sepal.Width    3
ggplot(iris_long, aes(x = name, y = id)) + 
  geom_tile(aes(fill = value))

Smaller values are darker. Package seriation provides a simpler plotting function.

library(seriation)
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
## 
## Attaching package: 'seriation'
## The following object is masked from 'package:lattice':
## 
##     panel.lines
ggpimage(iris_matrix, prop = FALSE)

We can scale the features to z-scores to make them better comparable.

iris_scaled <- scale(iris_matrix)
ggpimage(iris_scaled, prop = FALSE)

This reveals red and blue blocks. Each row is a flower and the flowers in the Iris dataset are sorted by species. The blue blocks for the top 50 flowers show that these flowers are smaller than average for all but Sepal.Width and the red blocks show that the bottom 50 flowers are larger for most features.

Often, reordering data matrices help with visualization. A reordering technique is called seriation. Ir reorders rows and columns to place more similar points closer together.

ggpimage(iris_scaled, order = seriate(iris_scaled), prop = FALSE)

We see that the rows (flowers) are organized from very blue to very red and the features are reordered to move Sepal.Width all the way to the right because it is very different from the other features.

2.6.6 Correlation Matrix

A correlation matrix contains the correlation between features.

cm1 <- iris |> 
  select(-Species) |> 
  as.matrix() |> 
  cor()
cm1
##              Sepal.Length Sepal.Width Petal.Length
## Sepal.Length       1.0000     -0.1176       0.8718
## Sepal.Width       -0.1176      1.0000      -0.4284
## Petal.Length       0.8718     -0.4284       1.0000
## Petal.Width        0.8179     -0.3661       0.9629
##              Petal.Width
## Sepal.Length      0.8179
## Sepal.Width      -0.3661
## Petal.Length      0.9629
## Petal.Width       1.0000

Package ggcorrplot provides a visualization for correlation matrices.

Package seriation provides a reordered version for this plot using a heatmap.

gghmap(cm1, prop = TRUE)

Correlations can also be calculates between objects by transposing the data matrix.

cm2 <- iris |> 
  select(-Species) |> 
  as.matrix() |> 
  t() |> 
  cor()

ggcorrplot(cm2)

Object-to-object correlations can be used as a measure of similarity. The dark red blocks indicate different species.

2.6.7 Parallel Coordinates Plot

Parallel coordinate plots can visualize several features in a single plot. Lines connect the values for each object (flower).

library(GGally)
ggparcoord(iris, columns = 1:4, groupColumn = 5)

The plot can be improved by reordering the variables to place correlated features next to each other.

o <- seriate(as.dist(1-cor(iris[,1:4])), method = "BBURCG")
get_order(o)
## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
##            3            4            1            2
ggparcoord(iris, 
           columns = as.integer(get_order(o)), 
           groupColumn = 5)

2.6.8 Star Plot

Star plots are a type of radar chart to visualize three or more quantitative variables represented on axes starting from the plot’s origin. R-base offers a simple star plot. We plot the first 5 flowers of each species.

flowers_5 <- iris[c(1:5, 51:55, 101:105), ]
flowers_5
## # A tibble: 15 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          7           3.2          4.7         1.4 versic…
##  7          6.4         3.2          4.5         1.5 versic…
##  8          6.9         3.1          4.9         1.5 versic…
##  9          5.5         2.3          4           1.3 versic…
## 10          6.5         2.8          4.6         1.5 versic…
## 11          6.3         3.3          6           2.5 virgin…
## 12          5.8         2.7          5.1         1.9 virgin…
## 13          7.1         3            5.9         2.1 virgin…
## 14          6.3         2.9          5.6         1.8 virgin…
## 15          6.5         3            5.8         2.2 virgin…
stars(flowers_5[, 1:4], ncol = 5)

2.6.9 More Visualizations

A well organized collection of visualizations with code can be found at The R Graph Gallery.

2.7 Exercises*

The R package palmerpenguins contains measurements for penguin of different species from the Palmer Archipelago, Antarctica. Install the package. It provides a CSV file which can be read in the following way:

library("palmerpenguins")
penguins <- read_csv(path_to_file("penguins.csv"))
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

Create in RStudio a new R Markdown document. Apply the code in the sections of this chapter to the data set to answer the following questions.

  1. What is the scale of measurement for each column?
  2. Are there missing values in the data? How much data is missing?
  3. Compute and discuss basic statistics.
  4. Group the penguins by species, island or sex. What can you find out?
  5. Can you identify correlations?
  6. Apply histograms, boxplots, scatterplots and correlation matrix visualization. What do the visualizations show.

Make sure your markdown document contains now a well formatted report. Use the Knit button to create a HTML document.