7 Cluster Analysis
This chapter introduces cluster analysis using K-means, hierarchical clustering and DBSCAN. We will discuss how to choose the number of clusters and how to evaluate the quality clusterings. In addition, we will introduce more clustering algorithms and how clustering is influenced by outliers.
The corresponding chapter of the data mining textbook is available online: Chapter 7: Cluster Analysis: Basic Concepts and Algorithms.
Packages Used in this Chapter
pkgs <- c("cluster", "dbscan", "e1071", "factoextra", "fpc",
"GGally", "kernlab", "mclust", "mlbench", "scatterpie",
"seriation", "tidyverse")
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)
The packages used for this chapter are:
- cluster (Maechler et al. 2023)
- dbscan (Hahsler and Piekenbrock 2024)
- e1071 (Meyer et al. 2024)
- factoextra (Kassambara and Mundt 2020)
- fpc (Hennig 2024)
- GGally (Schloerke et al. 2024)
- kernlab (Karatzoglou, Smola, and Hornik 2024)
- mclust (Fraley, Raftery, and Scrucca 2024)
- mlbench (Leisch and Dimitriadou 2024)
- scatterpie (Yu 2024)
- seriation (Hahsler, Buchta, and Hornik 2024)
- tidyverse (Wickham 2023c)
7.1 Overview
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
Clustering is also called unsupervised learning, because it tries to directly learns the structure of the data and does not rely on the availability of a correct answer or class label as supervised learning does. Clustering is often used for exploratory analysis or to preprocess data by grouping.
You can read the free sample chapter from the textbook (Tan, Steinbach, and Kumar 2005): Chapter 7. Cluster Analysis: Basic Concepts and Algorithms
7.1.1 Data Preparation
We will use here a small and very clean toy dataset called Ruspini which is included in the R package cluster.
data(ruspini, package = "cluster")
The Ruspini data set, consisting of 75 points in four groups that is
popular for illustrating clustering techniques. It is a very simple data
set with well separated clusters. The original dataset has the points
ordered by group. We can shuffle the data (rows) using sample_frac
which samples by default 100%.
ruspini <- as_tibble(ruspini) |>
sample_frac()
ruspini
## # A tibble: 75 × 2
## x y
## <int> <int>
## 1 115 117
## 2 111 126
## 3 85 96
## 4 12 88
## 5 36 72
## 6 28 147
## 7 24 58
## 8 99 128
## 9 46 142
## 10 38 151
## # ℹ 65 more rows
7.1.2 Data cleaning
ggplot(ruspini, aes(x = x, y = y)) + geom_point()
summary(ruspini)
## x y
## Min. : 4.0 Min. : 4.0
## 1st Qu.: 31.5 1st Qu.: 56.5
## Median : 52.0 Median : 96.0
## Mean : 54.9 Mean : 92.0
## 3rd Qu.: 76.5 3rd Qu.:141.5
## Max. :117.0 Max. :156.0
For most clustering algorithms it is necessary to handle missing values and outliers (e.g., remove the observations). For details see Section “Outlier removal” below. This data set has not missing values or strong outlier and looks like it has some very clear groups.
7.1.3 Scale data
Clustering algorithms use distances and the variables with the largest number range will dominate distance calculation. The summary above shows that this is not an issue for the Ruspini dataset with both, x and y, being roughly between 0 and 150. Most data analysts will still scale each column in the data to zero mean and unit standard deviation (z-scores).
Note: The standard scale()
function scales a whole data
matrix so we implement a function for a single vector and apply it to
all numeric columns.
## I use this till tidyverse implements a scale function
scale_numeric <- function(x) {
x |> mutate(across(where(is.numeric),
function(y) (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)))
}
ruspini_scaled <- ruspini |>
scale_numeric()
summary(ruspini_scaled)
## x y
## Min. :-1.6681 Min. :-1.8074
## 1st Qu.:-0.7665 1st Qu.:-0.7295
## Median :-0.0944 Median : 0.0816
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7088 3rd Qu.: 1.0158
## Max. : 2.0366 Max. : 1.3136
After scaling, most z-scores will fall in the range \([-3,3]\) (z-scores are measured in standard deviations from the mean), where \(0\) means average.
7.2 K-means
k-means implicitly assumes Euclidean distances. We use \(k = 4\) clusters and run the algorithm 10 times with random initialized centroids. The best result is returned.
km <- kmeans(ruspini_scaled, centers = 4, nstart = 10)
km
## K-means clustering with 4 clusters of sizes 23, 17, 15, 20
##
## Cluster means:
## x y
## 1 -0.3595 1.1091
## 2 1.4194 0.4693
## 3 0.4607 -1.4912
## 4 -1.1386 -0.5560
##
## Clustering vector:
## [1] 2 2 2 4 4 1 4 2 1 1 1 1 4 3 3 2 4 3 2 1 1 1 2 2 3 2 1 3
## [29] 3 2 1 3 3 4 2 1 3 4 4 3 1 4 1 3 2 4 3 1 2 1 1 2 2 4 4 2
## [57] 1 4 3 4 1 1 2 4 4 4 1 1 3 3 4 1 1 4 4
##
## Within cluster sum of squares by cluster:
## [1] 2.659 3.641 1.082 2.705
## (between_SS / total_SS = 93.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss"
## [4] "withinss" "tot.withinss" "betweenss"
## [7] "size" "iter" "ifault"
km
is an R object implemented as a list.
str(km)
## List of 9
## $ cluster : int [1:75] 2 2 2 4 4 1 4 2 1 1 ...
## $ centers : num [1:4, 1:2] -0.36 1.419 0.461 -1.139 1.109 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:2] "x" "y"
## $ totss : num 148
## $ withinss : num [1:4] 2.66 3.64 1.08 2.71
## $ tot.withinss: num 10.1
## $ betweenss : num 138
## $ size : int [1:4] 23 17 15 20
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
The clustering vector is just a list element
containing the cluster assignment for each data row and can be accessed
using km$cluster
. I add the cluster assignment as a column to the
original dataset (I make it a factor since it represents a nominal label).
ruspini_clustered <- ruspini |>
add_column(cluster = factor(km$cluster))
ruspini_clustered
## # A tibble: 75 × 3
## x y cluster
## <int> <int> <fct>
## 1 115 117 2
## 2 111 126 2
## 3 85 96 2
## 4 12 88 4
## 5 36 72 4
## 6 28 147 1
## 7 24 58 4
## 8 99 128 2
## 9 46 142 1
## 10 38 151 1
## # ℹ 65 more rows
ggplot(ruspini_clustered, aes(x = x, y = y)) +
geom_point(aes(color = cluster))
Add the centroids to the plot. The centroids are scaled, so we need to unscale them
to plot them over the original data. The second geom_points
uses
not the original data but specifies the centroids as its dataset.
unscale <- function(x, original_data) {
if (ncol(x) != ncol(original_data))
stop("Function needs matching columns!")
x * matrix(apply(original_data, MARGIN = 2, sd, na.rm = TRUE),
byrow = TRUE, nrow = nrow(x), ncol = ncol(x)) +
matrix(apply(original_data, MARGIN = 2, mean, na.rm = TRUE),
byrow = TRUE, nrow = nrow(x), ncol = ncol(x))
}
centroids <- km$centers %>% unscale(original_data = ruspini) %>% as_tibble(rownames = "cluster")
centroids
## # A tibble: 4 × 3
## cluster x y
## <chr> <dbl> <dbl>
## 1 1 43.9 146.
## 2 2 98.2 115.
## 3 3 68.9 19.4
## 4 4 20.2 65.0
ggplot(ruspini_clustered, aes(x = x, y = y)) +
geom_point(aes(color = cluster)) +
geom_point(data = centroids, aes(x = x, y = y, color = cluster),
shape = 3, size = 10)
The factoextra
package provides also a good visualization with object labels
and ellipses for clusters.
library(factoextra)
fviz_cluster(km, data = ruspini_scaled, centroids = TRUE,
repel = TRUE, ellipse.type = "norm")
7.2.1 Inspect Clusters
We inspect the clusters created by the 4-cluster k-means solution. The following code can be adapted to be used for other clustering methods.
Inspect the centroids with horizontal bar charts organized by cluster. To group the plots by cluster, we have to change the data format to the “long”-format using a pivot operation. I use colors to match the clusters in the scatter plots.
ggplot(pivot_longer(centroids,
cols = c(x, y),
names_to = "feature"),
#aes(x = feature, y = value, fill = cluster)) +
aes(x = value, y = feature, fill = cluster)) +
geom_bar(stat = "identity") +
facet_grid(cols = vars(cluster))
7.2.2 Extract a Single Cluster
You need is to filter the rows corresponding to the cluster index. The next example calculates summary statistics and then plots all data points of cluster 1.
cluster1 <- ruspini_clustered |>
filter(cluster == 1)
cluster1
## # A tibble: 23 × 3
## x y cluster
## <int> <int> <fct>
## 1 28 147 1
## 2 46 142 1
## 3 38 151 1
## 4 38 145 1
## 5 50 142 1
## 6 33 154 1
## 7 55 155 1
## 8 52 152 1
## 9 32 149 1
## 10 54 124 1
## # ℹ 13 more rows
summary(cluster1)
## x y cluster
## Min. :28.0 Min. :124 1:23
## 1st Qu.:36.5 1st Qu.:142 2: 0
## Median :44.0 Median :147 3: 0
## Mean :43.9 Mean :146 4: 0
## 3rd Qu.:51.0 3rd Qu.:152
## Max. :63.0 Max. :156
ggplot(cluster1, aes(x = x, y = y)) + geom_point() +
coord_cartesian(xlim = c(0, 120), ylim = c(0, 160))
What happens if we try to cluster with 8 centers?
fviz_cluster(kmeans(ruspini_scaled, centers = 8), data = ruspini_scaled,
centroids = TRUE, geom = "point", ellipse.type = "norm")
7.3 Agglomerative Hierarchical Clustering
Hierarchical clustering starts with a distance matrix. dist()
defaults
to method = "euclidean"
.
Notes:
- Distance matrices become very large quickly (the space complexity is \(O(n^2)\) where \(n\) is the number if data points). It is only possible to calculate and store the matrix for small to medium data sets (maybe a few hundred thousand data points) in main memory. If your data is too large then you can use sampling to reduce the number of points to cluster.
- The data needs to be scaled since we compute distances.
7.3.1 Creating a Dendrogram
d <- dist(ruspini_scaled)
hclust()
implements agglomerative hierarchical
clustering.
The linkage function defines how the distance between two clusters (sets of points)
is calculated. We specify the complete-link method where the distance between two groups
is measured by the distance between the two farthest apart points in the two groups.
hc <- hclust(d, method = "complete")
Hierarchical clustering does not return cluster assignments but a dendrogram object which shows how the objects on the x-axis are successively joined together when going up along the y-axis. The y-axis represents the distance at which groups of points are joined together. The standard plot function displays the dendrogram.
plot(hc)
The package factoextra
provides a ggplot function to plot dendrograms.
fviz_dend(hc)
More plotting options for dendrograms, including plotting parts of large dendrograms can be found here.
7.3.2 Extracting a Partitional Clustering
Partitional clusterings with cluster assignments can be extracted from a dendrogram
by cutting the dendrogram horizontally using cutree()
. Here we cut it into four parts
and add the cluster id to the data.
clusters <- cutree(hc, k = 4)
cluster_complete <- ruspini_scaled |>
add_column(cluster = factor(clusters))
cluster_complete
## # A tibble: 75 × 3
## x y cluster
## <dbl> <dbl> <fct>
## 1 1.97 0.513 1
## 2 1.84 0.698 1
## 3 0.987 0.0816 1
## 4 -1.41 -0.0827 2
## 5 -0.619 -0.411 2
## 6 -0.881 1.13 3
## 7 -1.01 -0.699 2
## 8 1.45 0.739 1
## 9 -0.291 1.03 3
## 10 -0.553 1.21 3
## # ℹ 65 more rows
The 4 different clusters can be shown in the dendrogram using different colors.
fviz_dend(hc, k = 4)
We can also color in the scatterplot.
ggplot(cluster_complete, aes(x, y, color = cluster)) +
geom_point()
The package factoextra
provides an alternative visualization. Since it needs
a partition object as the first argument, we create one as a simple list.
fviz_cluster(list(data = ruspini_scaled,
cluster = cutree(hc, k = 4)),
geom = "point")
Here are the results if we use the single-link method to join clusters. Single-link measures the distance between two groups by the by the distance between the two closest points from the two groups.
fviz_cluster(list(data = ruspini_scaled,
cluster = cutree(hc_single, k = 4)),
geom = "point")
The most important difference between complete-link and single-link is that complete-link prefers globular clusters and single-link shows chaining (see the staircase pattern in the dendrogram).
7.4 DBSCAN
DBSCAN stands for “Density-Based
Spatial Clustering of Applications with Noise.” It groups together
points that are closely packed together and treats points in low-density
regions as outliers. DBSCAN is implemented in package dbscan
.
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
7.4.1 DBSCAN Parameters
DBSCAN has two parameters that interact with each other. Changing one typically means that the other one also has to be adjusted.
minPts
defines how many points are needed in the epsilon
neighborhood to make a point a core point for a cluster. It is often chosen
as a smoothing parameter, where larger values smooth the density estimation more
but also ignore smaller structures with less then minPts.
eps
is the radius of the epsilon neighborhood around a point in which the
number of points is counted.
Users typically first select minPts. We use here minPts = 4
.
To decide on epsilon, the knee in the kNN distance plot is often used.
All points are sorted by their kNN distance. The idea is that points with
a low kNN distance are located in a dense area which should become a cluster.
Points with a large kNN distance are in a low density area and most
likely represent outliers and noise.
kNNdistplot(ruspini_scaled, minPts = 4)
abline(h = .32, col = "red")
Note that minPts contains the point itself, while the k-nearest neighbor
distance calculation does not.
The plot uses therefor k = minPts - 1
.
The knee is visible around eps = .32
(shown by the manually added red line).
The points to the left of the intersection of the
k-nearest neighbor distance line and the red line are the points that will be
core points during clustering at the specified parameters.
7.4.2 Cluster using DBSCAN
Clustering with dbscan()
returns a dbscan object.
db <- dbscan(ruspini_scaled, eps = .32, minPts = 4)
db
## DBSCAN clustering for 75 objects.
## Parameters: eps = 0.32, minPts = 4
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 5 noise points.
##
## 0 1 2 3 4
## 5 12 20 23 15
##
## Available fields: cluster, eps, minPts, metric,
## borderPoints
str(db)
## List of 5
## $ cluster : int [1:75] 1 1 0 2 2 3 2 1 3 3 ...
## $ eps : num 0.32
## $ minPts : num 4
## $ metric : chr "euclidean"
## $ borderPoints: logi TRUE
## - attr(*, "class")= chr [1:2] "dbscan_fast" "dbscan"
The cluster
element is the cluster assignment with cluster labels. The
special cluster label 0
means that the point is noise and not assigned to a
cluster.
ggplot(ruspini_scaled |> add_column(cluster = factor(db$cluster)),
aes(x, y, color = cluster)) + geom_point()
DBSCAN found 5 noise points. fviz_cluster()
can be
used to create a ggplot visualization.
fviz_cluster(db, ruspini_scaled, geom = "point")
Different values for minPts
and eps
can lead to vastly different clusterings.
Often, a misspecification leads to all points being noise points or a single cluster.
Also, if the clusters are not well separated, then DBSCAN has a hard
time splitting them.
7.5 Cluster Evaluation
7.5.1 Unsupervised Cluster Evaluation
This is also often called internal cluster evaluation since it does not use extra external labels.
We will evaluate the k-means clustering from above.
ggplot(ruspini_scaled,
aes(x, y, color = factor(km$cluster))) +
geom_point()
d <- dist(ruspini_scaled)
The two most popular quality metrics are the within-cluster sum of
squares (WCSS) used as the optimization objective by
\(k\)-means and the
average silhouette
width. Look at
within.cluster.ss
and avg.silwidth
below.
Some notes about the code:
- I do not load the package
fpc
usinglibrary()
since it would mask thedbscan()
function in packagedbscan
. Instead I use the namespace operator::
. - The clustering (second argument below) has to be supplied as a vector
of integers (cluster IDs) and cannot be a factor (to make sure, we can use
as.integer()
).
# library(fpc)
fpc::cluster.stats(d, as.integer(km$cluster))
## $n
## [1] 75
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 23 17 15 20
##
## $min.cluster.size
## [1] 15
##
## $noisen
## [1] 0
##
## $diameter
## [1] 1.1591 1.4627 0.8359 1.1193
##
## $average.distance
## [1] 0.4286 0.5806 0.3564 0.4824
##
## $median.distance
## [1] 0.3934 0.5024 0.3380 0.4492
##
## $separation
## [1] 0.7676 0.7676 1.1577 1.1577
##
## $average.toother
## [1] 2.149 2.293 2.308 2.157
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000 0.7676 1.958 1.220
## [2,] 0.7676 0.0000 1.308 1.340
## [3,] 1.9577 1.3084 0.000 1.158
## [4,] 1.2199 1.3397 1.158 0.000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000 1.925 2.750 1.887
## [2,] 1.925 0.000 2.220 2.772
## [3,] 2.750 2.220 0.000 1.874
## [4,] 1.887 2.772 1.874 0.000
##
## $average.between
## [1] 2.219
##
## $average.within
## [1] 0.463
##
## $n.between
## [1] 2091
##
## $n.within
## [1] 684
##
## $max.diameter
## [1] 1.463
##
## $min.separation
## [1] 0.7676
##
## $within.cluster.ss
## [1] 10.09
##
## $clus.avg.silwidths
## 1 2 3 4
## 0.7455 0.6813 0.8074 0.7211
##
## $avg.silwidth
## [1] 0.7368
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.8416
##
## $dunn
## [1] 0.5248
##
## $dunn2
## [1] 3.228
##
## $entropy
## [1] 1.373
##
## $wb.ratio
## [1] 0.2086
##
## $ch
## [1] 323.6
##
## $cwidegap
## [1] 0.3153 0.4150 0.2352 0.2612
##
## $widestgap
## [1] 0.415
##
## $sindex
## [1] 0.8583
##
## $corrected.rand
## NULL
##
## $vi
## NULL
The correlation between the distances and a 0-1-vector cluster incidence matrix is
called pearsongamma
.
Some numbers are NULL
. These are measures only available for supervised evaluation.
Read the man page for cluster.stats()
for an explanation of all the available indices.
We can compare different clusterings.
sapply(
list(
km_4 = km$cluster,
hc_compl_4 = cutree(hc, k = 4),
hc_compl_5 = cutree(hc, k = 5),
hc_single_5 = cutree(hc_single, k = 5)
),
FUN = function(x)
fpc::cluster.stats(d, x))[c("within.cluster.ss", "avg.silwidth", "pearsongamma", "dunn"), ]
## km_4 hc_compl_4 hc_compl_5 hc_single_5
## within.cluster.ss 10.09 10.09 8.314 7.791
## avg.silwidth 0.7368 0.7368 0.6642 0.6886
## pearsongamma 0.8416 0.8416 0.8042 0.816
## dunn 0.5248 0.5248 0.1988 0.358
With 4 clusters, k-means and hierarchical clustering produce exactly the same clustering. While the two different hierarchical methods with 5 clusters produce a smaller WCSS, they are actually worse given all three other measures.
Next, we look at the silhouette using a silhouette plot.
library(cluster)
##
## Attaching package: 'cluster'
## The following object is masked _by_ '.GlobalEnv':
##
## ruspini
plot(silhouette(km$cluster, d))
Note: The silhouette plot does not show correctly in R Studio if you
have too many objects (bars are missing). I will work when you open a
new plotting device with windows()
, x11()
or quartz()
.
ggplot visualization using factoextra
fviz_silhouette(silhouette(km$cluster, d))
## cluster size ave.sil.width
## 1 1 23 0.75
## 2 2 17 0.68
## 3 3 15 0.81
## 4 4 20 0.72
Unsupervised cluster evaluation can be done by inspecting and visualizing the proximity matrix. We can inspect the distance matrix between the first 5 objects.
as.matrix(d)[1:5, 1:5]
## 1 2 3 4 5
## 1 0.0000 0.2266 1.074 3.4289 2.7498
## 2 0.2266 0.0000 1.052 3.3381 2.6972
## 3 1.0739 1.0517 0.000 2.3989 1.6803
## 4 3.4289 3.3381 2.399 0.0000 0.8527
## 5 2.7498 2.6972 1.680 0.8527 0.0000
A false-color image visualizes each value in the matrix as a pixel with the color representing the value.
Rows and columns are the objects as they are ordered in the data set. The diagonal represents the distance between an object and itself and has by definition a distance of 0 (dark line). Visualizing the unordered distance matrix does not show much structure, but we can reorder the matrix (rows and columns) using the k-means cluster labels from cluster 1 to 4. A clear block structure representing the clusters becomes visible.
Plot function dissplot
in package seriation rearranges the matrix
and adds lines and cluster labels. In the lower half of the plot, it
shows average dissimilarities between clusters. The function organizes
the objects by cluster and then reorders clusters and objects within
clusters so that more similar objects are closer together.
ggdissplot(d, labels = km$cluster,
options = list(main = "k-means with k=4"))
The reordering by dissplot
makes the misspecification of k visible as
blocks.
ggdissplot(d,
labels = kmeans(ruspini_scaled, centers = 3)$cluster)
ggdissplot(d,
labels = kmeans(ruspini_scaled, centers = 9)$cluster)
Using factoextra
fviz_dist(d)
7.5.2 Determining the Correct Number of Clusters
The user needs to specify the number of clusters for most clustering algorithms. Determining the number of clusters in a data set is therefor an important task before we can cluster data. We will apply different methods to the scaled Ruspini data set.
ggplot(ruspini_scaled, aes(x, y)) + geom_point()
## We will use different methods and try 1-10 clusters.
set.seed(1234)
ks <- 2:10
7.5.2.1 Elbow Method: Within-Cluster Sum of Squares
A method often used for k-means is to calculate the within-cluster sum of squares for different numbers of
clusters and look for the knee or
elbow in the
plot. We use nstart = 5
to restart k-means 5 times and return the best
solution.
7.5.2.2 Average Silhouette Width
Another popular method (often preferred for clustering methods starting with a distance matrix) is to plot the average silhouette width for different numbers of clusters and look for the maximum in the plot.
ASW <- sapply(ks, FUN=function(k) {
fpc::cluster.stats(d,
kmeans(ruspini_scaled,
centers = k,
nstart = 5)$cluster)$avg.silwidth
})
best_k <- ks[which.max(ASW)]
best_k
## [1] 4
ggplot(tibble(ks, ASW), aes(ks, ASW)) +
geom_line() +
geom_vline(xintercept = best_k, color = "red", linetype = 2)
7.5.2.3 Dunn Index
The Dunn index is another internal measure given by the smallest separation between clusters scaled by the largest cluster diameter.
DI <- sapply(ks, FUN = function(k) {
fpc::cluster.stats(d,
kmeans(ruspini_scaled, centers = k,
nstart = 5)$cluster)$dunn
})
best_k <- ks[which.max(DI)]
ggplot(tibble(ks, DI), aes(ks, DI)) +
geom_line() +
geom_vline(xintercept = best_k, color = "red", linetype = 2)
7.5.2.4 Gap Statistic
The Gap statisticCompares the change in within-cluster dispersion with that expected from
a null model (see clusGap()
). The default method is to choose the
smallest k such that its value Gap(k) is not more than 1 standard error
away from the first local maximum.
library(cluster)
k <- clusGap(ruspini_scaled,
FUN = kmeans,
nstart = 10,
K.max = 10)
k
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = ruspini_scaled, FUNcluster = kmeans, K.max = 10, nstart = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
## logW E.logW gap SE.sim
## [1,] 3.498 3.467 -0.03080 0.03570
## [2,] 3.074 3.150 0.07625 0.03744
## [3,] 2.678 2.903 0.22466 0.03796
## [4,] 2.106 2.704 0.59710 0.03633
## [5,] 1.987 2.570 0.58271 0.03474
## [6,] 1.864 2.451 0.58713 0.03651
## [7,] 1.732 2.348 0.61558 0.03954
## [8,] 1.640 2.256 0.61570 0.04132
## [9,] 1.561 2.172 0.61129 0.04090
## [10,] 1.513 2.093 0.57987 0.03934
plot(k)
There have been many other methods and indices proposed to determine the number of clusters. See, e.g., package NbClust.
7.5.3 Clustering Tendency
Most clustering algorithms will always produce a clustering, even if the data does not contain a cluster structure. It is typically good to check cluster tendency before attempting to cluster the data.
We use again the smiley data.
library(mlbench)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)$x
colnames(shapes) <- c("x", "y")
shapes <- as_tibble(shapes)
7.5.3.1 Scatter plots
The first step is visual inspection using scatter plots.
ggplot(shapes, aes(x = x, y = y)) + geom_point()
Cluster tendency is typically indicated by several separated point clouds. Often an appropriate number of clusters can also be visually obtained by counting the number of point clouds. We see four clusters, but the mouth is not convex/spherical and thus will pose a problems to algorithms like k-means.
If the data has more than two features then you can use a pairs plot (scatterplot matrix) or look at a scatterplot of the first two principal components using PCA.
7.5.3.2 Visual Analysis for Cluster Tendency Assessment (VAT)
VAT reorders the objects to show potential clustering tendency as a block structure (dark blocks along the main diagonal). We scale the data before using Euclidean distance.
iVAT uses the largest distances for all possible paths between two objects instead of the direct distances to make the block structure better visible.
ggiVAT(d_shapes)
7.5.3.3 Hopkins statistic
factoextra
can also create a VAT plot and calculate the Hopkins
statistic to assess
clustering tendency. For the Hopkins statistic, a sample of size \(n\) is
drawn from the data and then compares the nearest neighbor distribution
with a simulated dataset drawn from a random uniform distribution (see
detailed
explanation).
A values >.5 indicates usually a clustering tendency.
get_clust_tendency(shapes, n = 10)
## $hopkins_stat
## [1] 0.9074
##
## $plot
Both plots show a strong cluster structure with 4 clusters.
7.5.3.4 Data Without Clustering Tendency
No point clouds are visible, just noise.
ggiVAT(d_random)
get_clust_tendency(data_random, n = 10, graph = FALSE)
## $hopkins_stat
## [1] 0.4642
##
## $plot
## NULL
There is very little clustering structure visible indicating low clustering tendency and clustering should not be performed on this data. However, k-means can be used to partition the data into \(k\) regions of roughly equivalent size. This can be used as a data-driven discretization of the space.
7.5.3.5 k-means on Data Without Clustering Tendency
What happens if we perform k-means on data that has no inherent clustering structure?
km <- kmeans(data_random, centers = 4)
random_clustered<- data_random |>
add_column(cluster = factor(km$cluster))
ggplot(random_clustered, aes(x = x, y = y, color = cluster)) +
geom_point()
k-means discretizes the space into similarly sized regions.
7.5.4 Supervised Cluster Evaluation
Also called external cluster validation since it uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a known class label not provided to the clustering algorithm.
We use an artificial data set with known groups.
library(mlbench)
set.seed(1234)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)
plot(shapes)
First, we prepare the data and hide the known class label which we will used for evaluation after clustering.
truth <- as.integer(shapes$class)
shapes <- shapes$x
colnames(shapes) <- c("x", "y")
shapes <- shapes |> scale() |> as_tibble()
ggplot(shapes, aes(x, y)) + geom_point()
Find optimal number of Clusters for k-means
ks <- 2:20
Use within sum of squares (look for the knee)
WCSS <- sapply(ks, FUN = function(k) {
kmeans(shapes, centers = k, nstart = 10)$tot.withinss
})
ggplot(tibble(ks, WCSS), aes(ks, WCSS)) + geom_line()
Looks like it could be 7 clusters?
km <- kmeans(shapes, centers = 7, nstart = 10)
ggplot(shapes |> add_column(cluster = factor(km$cluster)),
aes(x, y, color = cluster)) +
geom_point()
The mouth is an issue for k-means. We could use hierarchical clustering with single-linkage because of the mouth is non-convex and chaining may help.
Find optimal number of clusters
ASW <- sapply(ks, FUN = function(k) {
fpc::cluster.stats(d, cutree(hc, k))$avg.silwidth
})
ggplot(tibble(ks, ASW), aes(ks, ASW)) + geom_line()
The maximum is clearly at 4 clusters.
hc_4 <- cutree(hc, 4)
ggplot(shapes |> add_column(cluster = factor(hc_4)),
aes(x, y, color = cluster)) +
geom_point()
Compare with ground truth with the adjusted Rand index (ARI, also cpp!C00L corrected Rand index), the variation of information (VI) index, mutual information (MI), entropy and purity.
cluster_stats()
computes ARI and VI as comparative measures. I define
functions for the purity and the entropy of a clustering given the ground truth here:
purity <- function(cluster, truth, show_table = FALSE) {
if (length(cluster) != length(truth))
stop("Cluster vector and ground truth vectors are not of the same length!")
# tabulate
tbl <- table(cluster, truth)
if(show_table)
print(tbl)
# find majority class
majority <- apply(tbl, 1, max)
sum(majority) / length(cluster)
}
entropy <- function(cluster, truth, show_table = FALSE) {
if (length(cluster) != length(truth))
stop("Cluster vector and ground truth vectors are not of the same length!")
# calculate membership probability of cluster to class
tbl <- table(cluster, truth)
p <- sweep(tbl, 2, colSums(tbl), "/")
if(show_table)
print(p)
# calculate cluster entropy
e <- -p * log(p, 2)
e <- rowSums(e, na.rm = TRUE)
# weighted sum over clusters
w <- table(cluster) / length(cluster)
sum(w * e)
}
We calculate the measures and add for comparison two random “clusterings” with 4 and 6 clusters.
random_4 <- sample(1:4, nrow(shapes), replace = TRUE)
random_6 <- sample(1:6, nrow(shapes), replace = TRUE)
r <- rbind(
truth = c(
unlist(fpc::cluster.stats(d, truth,
truth, compareonly = TRUE)),
purity = purity(truth, truth),
entropy = entropy(truth, truth)
),
kmeans_7 = c(
unlist(fpc::cluster.stats(d, km$cluster,
truth, compareonly = TRUE)),
purity = purity(km$cluster, truth),
entropy = entropy(km$cluster, truth)
),
hc_4 = c(
unlist(fpc::cluster.stats(d, hc_4,
truth, compareonly = TRUE)),
purity = purity(hc_4, truth),
entropy = entropy(hc_4, truth)
),
random_4 = c(
unlist(fpc::cluster.stats(d, random_4,
truth, compareonly = TRUE)),
purity = purity(random_4, truth),
entropy = entropy(random_4, truth)
),
random_6 = c(
unlist(fpc::cluster.stats(d, random_6,
truth, compareonly = TRUE)),
purity = purity(random_6, truth),
entropy = entropy(random_6, truth)
)
)
r
## corrected.rand vi purity entropy
## truth 1.000000 0.0000 1.000 0.0000
## kmeans_7 0.638229 0.5709 1.000 0.2088
## hc_4 1.000000 0.0000 1.000 0.0000
## random_4 -0.003235 2.6832 0.418 1.9895
## random_6 -0.002125 3.0763 0.418 1.7129
Notes:
- Comparing the ground truth to itself produces perfect scores.
- Hierarchical clustering found the perfect clustering.
- Entropy and purity are heavily impacted by the number of clusters (more clusters improve the metric) and the fact that the mouth has 41.8% of all the data points becoming automatically the majority class for purity.
- The adjusted rand index shows clearly that the random clusterings have no relationship with the ground truth (very close to 0). This is a very helpful property and explains why the ARI is the most popular measure.
Read the manual page for fpc::cluster.stats()
for an explanation of all the available indices.
7.6 More Clustering Algorithms*
Note: Some of these methods are covered in Chapter 8 of the textbook.
7.6.1 Partitioning Around Medoids (PAM)
PAM tries to solve the \(k\)-medoids problem. The problem is similar to \(k\)-means, but uses medoids instead of centroids to represent clusters. Like hierarchical clustering, it typically works with a precomputed distance matrix. An advantage is that you can use any distance metric not just Euclidean distances. Note: The medoid is the most central data point in the middle of the cluster. PAM is a lot more computationally expensive compared to k-means.
library(cluster)
d <- dist(ruspini_scaled)
str(d)
## 'dist' num [1:2775] 0.227 1.074 3.429 2.75 2.918 ...
## - attr(*, "Size")= int 75
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "Euclidean"
## - attr(*, "call")= language dist(x = ruspini_scaled)
p <- pam(d, k = 4)
p
## Medoids:
## ID
## [1,] 49 49
## [2,] 46 46
## [3,] 41 41
## [4,] 44 44
## Clustering vector:
## [1] 1 1 1 2 2 3 2 1 3 3 3 3 2 4 4 1 2 4 1 3 3 3 1 1 4 1 3 4
## [29] 4 1 3 4 4 2 1 3 4 2 2 4 3 2 3 4 1 2 4 3 1 3 3 1 1 2 2 1
## [57] 3 2 4 2 3 3 1 2 2 2 3 3 4 4 2 3 3 2 2
## Objective function:
## build swap
## 0.4423 0.3187
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective"
## [5] "isolation" "clusinfo" "silinfo" "diss"
## [9] "call"
Extract the clustering and medoids for visualization.
ruspini_clustered <- ruspini_scaled |>
add_column(cluster = factor(p$cluster))
medoids <- as_tibble(ruspini_scaled[p$medoids, ],
rownames = "cluster")
medoids
## # A tibble: 4 × 3
## cluster x y
## <chr> <dbl> <dbl>
## 1 1 1.45 0.554
## 2 2 -1.18 -0.555
## 3 3 -0.357 1.17
## 4 4 0.463 -1.46
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) +
geom_point() +
geom_point(data = medoids, aes(x = x, y = y, color = cluster),
shape = 3, size = 10)
Alternative visualization using fviz_cluster()
.
## __Note:__ `fviz_cluster` needs the original data.
fviz_cluster(c(p, list(data = ruspini_scaled)),
geom = "point",
ellipse.type = "norm")
7.6.2 Gaussian Mixture Models
library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
##
## map
Gaussian mixture
models
assume that the data set is the result of drawing data from a set of
Gaussian distributions where each distribution represents a cluster.
Estimation algorithms try to identify the location parameters of the
distributions and thus can be used to find clusters. Mclust()
uses
Bayesian Information Criterion (BIC) to find the number of clusters
(model selection). BIC uses the likelihood and a penalty term to guard
against overfitting.
m <- Mclust(ruspini_scaled)
summary(m)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 5
## components:
##
## log-likelihood n df BIC ICL
## -91.26 75 16 -251.6 -251.7
##
## Clustering table:
## 1 2 3 4 5
## 14 3 20 23 15
plot(m, what = "classification")
Rerun with a fixed number of 4 clusters
m <- Mclust(ruspini_scaled, G=4)
summary(m)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 4
## components:
##
## log-likelihood n df BIC ICL
## -101.6 75 13 -259.3 -259.3
##
## Clustering table:
## 1 2 3 4
## 17 20 23 15
plot(m, what = "classification")
7.6.3 Spectral clustering
Spectral clustering works by embedding the data points of the partitioning problem into the subspace of the k largest eigenvectors of a normalized affinity/kernel matrix. Then uses a simple clustering method like k-means.
library("kernlab")
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:arulesSequences':
##
## size
## The following object is masked from 'package:scales':
##
## alpha
## The following object is masked from 'package:arules':
##
## size
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
cluster_spec <- specc(as.matrix(ruspini_scaled), centers = 4)
cluster_spec
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 3 3 3 4 4 1 4 3 1 1 1 1 4 2 2 3 4 2 3 1 1 1 3 3 2 3 1 2 2 3 1 2 2 4 3 1 2 4 4 2 1 4 1 2 3 4 2 1 3 1 1 3 3 4 4 3 1 4 2 4 1 1 3 4 4 4 1 1 2 2 4 1 1 4 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 22.6352409453921
##
## Centers:
## [,1] [,2]
## [1,] -0.3595 1.1091
## [2,] 0.4607 -1.4912
## [3,] 1.4194 0.4693
## [4,] -1.1386 -0.5560
##
## Cluster size:
## [1] 23 15 17 20
##
## Within-cluster sum of squares:
## [1] 52.001 49.505 15.880 9.855
ggplot(ruspini_scaled |>
add_column(cluster = factor(cluster_spec)),
aes(x, y, color = cluster)) +
geom_point()
7.6.4 Fuzzy C-Means Clustering
The fuzzy clustering version of the k-means clustering problem. Each data point has a degree of membership to for each cluster.
library("e1071")
cluster_cmeans <- cmeans(as.matrix(ruspini_scaled), centers = 4)
cluster_cmeans
## Fuzzy c-means clustering with 4 clusters
##
## Cluster centers:
## x y
## 1 -0.3763 1.1143
## 2 -1.1371 -0.5550
## 3 0.4552 -1.4760
## 4 1.5048 0.5161
##
## Memberships:
## 1 2 3 4
## [1,] 3.390e-02 0.0184315 0.0318371 9.158e-01
## [2,] 2.683e-02 0.0130798 0.0205431 9.395e-01
## [3,] 1.100e-01 0.0654667 0.1188610 7.056e-01
## [4,] 9.817e-02 0.8288408 0.0452756 2.771e-02
## [5,] 9.312e-02 0.7683796 0.0971244 4.138e-02
## [6,] 8.622e-01 0.0758548 0.0256694 3.626e-02
## [7,] 9.546e-03 0.9731667 0.0127762 4.511e-03
## [8,] 1.483e-02 0.0061528 0.0087250 9.703e-01
## [9,] 9.889e-01 0.0046259 0.0021818 4.268e-03
## [10,] 9.753e-01 0.0114717 0.0048175 8.409e-03
## [11,] 9.787e-01 0.0103280 0.0041355 6.879e-03
## [12,] 9.597e-01 0.0151615 0.0078892 1.728e-02
## [13,] 2.687e-02 0.9343464 0.0272600 1.153e-02
## [14,] 2.343e-02 0.0384835 0.8921799 4.591e-02
## [15,] 1.016e-02 0.0183426 0.9537436 1.775e-02
## [16,] 2.457e-03 0.0011603 0.0018407 9.945e-01
## [17,] 3.663e-02 0.8900877 0.0549646 1.831e-02
## [18,] 1.136e-02 0.0249347 0.9472522 1.646e-02
## [19,] 1.739e-01 0.1075081 0.1773901 5.412e-01
## [20,] 9.245e-01 0.0371563 0.0146304 2.371e-02
## [21,] 8.916e-01 0.0333605 0.0199925 5.510e-02
## [22,] 9.396e-01 0.0204596 0.0114691 2.852e-02
## [23,] 4.091e-02 0.0229225 0.0405208 8.957e-01
## [24,] 7.444e-03 0.0032213 0.0047480 9.846e-01
## [25,] 9.526e-03 0.0254625 0.9531437 1.187e-02
## [26,] 2.177e-01 0.1283064 0.1837577 4.703e-01
## [27,] 9.187e-01 0.0419859 0.0155199 2.380e-02
## [28,] 1.087e-02 0.0205930 0.9503614 1.818e-02
## [29,] 1.347e-03 0.0031668 0.9936328 1.853e-03
## [30,] 5.072e-04 0.0002500 0.0004109 9.988e-01
## [31,] 7.514e-01 0.0920641 0.0519039 1.047e-01
## [32,] 9.017e-03 0.0173387 0.9591007 1.454e-02
## [33,] 1.664e-03 0.0038610 0.9921769 2.298e-03
## [34,] 3.001e-02 0.9148709 0.0405049 1.461e-02
## [35,] 1.130e-02 0.0058690 0.0099612 9.729e-01
## [36,] 9.103e-01 0.0483879 0.0168066 2.448e-02
## [37,] 1.727e-02 0.0498112 0.9125435 2.038e-02
## [38,] 5.521e-02 0.8604295 0.0593827 2.497e-02
## [39,] 5.307e-02 0.8953172 0.0336329 1.798e-02
## [40,] 1.083e-02 0.0287381 0.9470194 1.341e-02
## [41,] 9.977e-01 0.0009642 0.0004512 8.878e-04
## [42,] 2.538e-02 0.9426475 0.0220730 9.897e-03
## [43,] 7.551e-01 0.0672204 0.0448198 1.329e-01
## [44,] 5.053e-05 0.0001096 0.9997656 7.424e-05
## [45,] 9.619e-02 0.0392619 0.0536219 8.109e-01
## [46,] 4.470e-04 0.9989329 0.0004363 1.838e-04
## [47,] 3.346e-03 0.0082407 0.9839904 4.422e-03
## [48,] 9.923e-01 0.0033856 0.0014976 2.768e-03
## [49,] 1.324e-03 0.0006092 0.0009437 9.971e-01
## [50,] 7.016e-01 0.0713856 0.0509694 1.760e-01
## [51,] 9.467e-01 0.0256411 0.0103562 1.729e-02
## [52,] 1.387e-02 0.0075739 0.0135384 9.650e-01
## [53,] 8.074e-03 0.0034561 0.0050418 9.834e-01
## [54,] 2.248e-02 0.9302629 0.0358208 1.143e-02
## [55,] 4.917e-02 0.8877345 0.0431078 1.999e-02
## [56,] 1.267e-01 0.0394178 0.0461292 7.877e-01
## [57,] 9.754e-01 0.0120663 0.0047505 7.762e-03
## [58,] 2.256e-02 0.9540764 0.0155193 7.844e-03
## [59,] 1.396e-03 0.0030791 0.9934965 2.029e-03
## [60,] 1.357e-02 0.9712826 0.0102384 4.907e-03
## [61,] 9.293e-01 0.0376080 0.0133145 1.973e-02
## [62,] 9.272e-01 0.0247817 0.0139384 3.405e-02
## [63,] 1.929e-02 0.0106791 0.0192351 9.508e-01
## [64,] 3.837e-02 0.8665080 0.0740350 2.108e-02
## [65,] 3.143e-03 0.9920954 0.0034025 1.359e-03
## [66,] 1.710e-02 0.9626089 0.0138085 6.485e-03
## [67,] 9.763e-01 0.0095410 0.0046349 9.540e-03
## [68,] 9.885e-01 0.0044831 0.0022370 4.751e-03
## [69,] 1.011e-02 0.0241498 0.9523630 1.337e-02
## [70,] 1.111e-02 0.0203874 0.9492340 1.926e-02
## [71,] 4.507e-02 0.9045022 0.0339790 1.645e-02
## [72,] 9.964e-01 0.0015597 0.0007050 1.321e-03
## [73,] 9.693e-01 0.0112508 0.0059273 1.354e-02
## [74,] 4.260e-02 0.8727568 0.0633457 2.130e-02
## [75,] 2.104e-02 0.9397664 0.0290866 1.010e-02
##
## Closest hard clustering:
## [1] 4 4 4 2 2 1 2 4 1 1 1 1 2 3 3 4 2 3 4 1 1 1 4 4 3 4 1 3
## [29] 3 4 1 3 3 2 4 1 3 2 2 3 1 2 1 3 4 2 3 1 4 1 1 4 4 2 2 4
## [57] 1 2 3 2 1 1 4 2 2 2 1 1 3 3 2 1 1 2 2
##
## Available components:
## [1] "centers" "size" "cluster" "membership"
## [5] "iter" "withinerror" "call"
Plot membership (shown as small pie charts)
library("scatterpie")
## scatterpie v0.2.4 Learn more at https://yulab-smu.top/
##
## Please cite:
##
## D Wang, G Chen, L Li, S Wen, Z Xie, X Luo, L
## Zhan, S Xu, J Li, R Wang, Q Wang, G Yu. Reducing language
## barriers, promoting information absorption, and
## communication using fanyi. Chinese Medical Journal. 2024,
## 137(16):1950-1956. doi: 10.1097/CM9.0000000000003242
ggplot() +
geom_scatterpie(
data = cbind(ruspini_scaled, cluster_cmeans$membership),
aes(x = x, y = y),
cols = colnames(cluster_cmeans$membership),
legend_name = "Membership") +
coord_equal()
7.7 Outliers in Clustering*
Most clustering algorithms perform complete assignment (i.e., all data points need to be assigned to a cluster). Outliers will affect the clustering. It is useful to identify outliers and remove strong outliers prior to clustering. A density based method to identify outlier is LOF (Local Outlier Factor). It is related to dbscan and compares the density around a point with the densities around its neighbors (you have to specify the neighborhood size \(k\)). The LOF value for a regular data point is 1. The larger the LOF value gets, the more likely the point is an outlier.
Add a clear outlier to the scaled Ruspini dataset that is 10 standard deviations above the average for the x axis.
ruspini_scaled_outlier <- ruspini_scaled |> add_case(x=10,y=0)
7.7.1 Visual inspection of the data
Outliers can be identified using summary statistics, histograms,
scatterplots (pairs plots), and boxplots, etc. We use here a pairs plot
(the diagonal contains smoothed histograms). The outlier is visible as
the single separate point in the scatter plot and as the long tail of
the smoothed histogram for x
. Remember, in scaled data, we expect most observations to
fall in the range \([-3,3]\).
The outlier messes up the scaling which presents a problem for k-means.
km <- kmeans(ruspini_scaled_outlier, centers = 4, nstart = 10)
ruspini_scaled_outlier_km <- ruspini_scaled_outlier|>
add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")
ggplot(ruspini_scaled_outlier_km, aes(x = x, y = y, color = cluster)) +
geom_point() +
geom_point(data = centroids,
aes(x = x, y = y, color = cluster),
shape = 3, size = 10)
This problem can be fixed by increasing the number of clusters and removing small clusters in a post-processing step or by identifying and removing outliers before clustering.
7.7.2 Local Outlier Factor (LOF)
The Local Outlier Factor is related to concepts of DBSCAN can help to identify potential outliers. To calculate the LOF, a local neighborhood size (MinPts in DBSCAN) for density estimation needs to be chosen. I use 10 here since I expect my clusters to have each more than 10 points.
lof <- lof(ruspini_scaled_outlier, minPts= 10)
lof
## [1] 1.0842 1.0062 1.3682 1.2124 1.0830 1.1750 0.9778
## [8] 0.9982 0.9716 0.9653 0.9903 1.0409 1.0509 1.2518
## [15] 1.0262 0.9377 1.0198 1.0238 1.5716 1.0680 1.2088
## [22] 1.0702 1.1375 1.0022 0.9957 1.5976 1.0340 1.0708
## [29] 1.0145 0.9385 1.6258 0.9758 0.9972 1.0828 1.0206
## [36] 1.0396 1.0214 1.2090 1.0881 0.9809 0.9319 1.0127
## [43] 1.3973 0.9939 1.1603 0.9411 0.9928 0.9427 0.9329
## [50] 1.4943 1.0213 1.0055 0.9920 0.9536 1.0852 1.1429
## [57] 0.9815 1.0200 0.9864 0.9175 1.0340 1.1045 1.0041
## [64] 1.0413 0.9825 1.0393 0.9850 1.0187 0.9625 1.0593
## [71] 1.0008 0.9336 1.0515 1.0234 0.9767 17.8995
ggplot(ruspini_scaled_outlier |> add_column(lof = lof),
aes(x, y, color = lof)) +
geom_point() +
scale_color_gradient(low = "gray", high = "red")
Plot the points sorted by increasing LOF and look for a knee.
ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)),
aes(index, lof)) +
geom_line() +
geom_hline(yintercept = 1, color = "red", linetype = 2)
We choose here a threshold above 2.
ggplot(ruspini_scaled_outlier |> add_column(outlier = lof >= 2),
aes(x, y, color = outlier)) +
geom_point()
You should analyze the found outliers. They are often interesting and important data points. Perform clustering on the regular data without outliers.
ruspini_scaled_clean <- ruspini_scaled_outlier |> filter(lof < 2)
km <- kmeans(ruspini_scaled_clean, centers = 4, nstart = 10)
ruspini_scaled_clean_km <- ruspini_scaled_clean|>
add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")
ggplot(ruspini_scaled_clean_km, aes(x = x, y = y, color = cluster)) +
geom_point() +
geom_point(data = centroids,
aes(x = x, y = y, color = cluster),
shape = 3, size = 10)
There are many other outlier removal strategies available. See, e.g., package outliers.
7.8 Exercises*
We will again use the Palmer penguin data for the exercises.
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm
## <chr> <chr> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 19.3
## 6 Adelie Torgersen 39.3 20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, year <dbl>
Create a R markdown file with the code and discussion for the following below.
- What features do you use for clustering? What about missing values? Discuss your answers. Do you need to scale the data before clustering? Why?
- What distance measure do you use to reflect similarities between penguins? See Measures of Similarity and Dissimilarity in Chapter 2.
- Apply k-means clustering. Use an appropriate method to determine the number of clusters. Compare the clustering using unscaled data and scaled data. What is the difference? Visualize and describe the results.
- Apply hierarchical clustering. Create a dendrogram and discuss what it means.
- Apply DBSCAN. How do you choose the parameters? How well does it work?