7 Cluster Analysis

This chapter introduces cluster analysis using K-means, hierarchical clustering and DBSCAN. We will discuss how to choose the number of clusters and how to evaluate the quality clusterings. In addition, we will introduce more clustering algorithms and how clustering is influenced by outliers.

The corresponding chapter of the data mining textbook is available online: Chapter 7: Cluster Analysis: Basic Concepts and Algorithms.

Packages Used in this Chapter

pkgs <- c("cluster", "dbscan", "e1071", "factoextra", "fpc", 
          "GGally", "kernlab", "mclust", "mlbench", "scatterpie", 
          "seriation", "tidyverse")
  
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are:

cluster (Maechler et al. 2025)
dbscan (Hahsler and Piekenbrock 2025)
e1071 (Meyer et al. 2024)
factoextra (Kassambara and Mundt 2020)
fpc (Hennig 2024)
GGally (Schloerke et al. 2024)
kernlab (Karatzoglou, Smola, and Hornik 2024)
mclust (Fraley, Raftery, and Scrucca 2024)
mlbench (Leisch and Dimitriadou 2024)
scatterpie (Yu 2024)
seriation (Hahsler, Buchta, and Hornik 2024)
tidyverse (Wickham 2023c)

7.1 Overview

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Clustering is also called unsupervised learning, because it tries to directly learns the structure of the data and does not rely on the availability of a correct answer or class label as supervised learning does. Clustering is often used for exploratory analysis or to preprocess data by grouping.

You can read the free sample chapter from the textbook (Tan, Steinbach, and Kumar 2005): Chapter 7. Cluster Analysis: Basic Concepts and Algorithms

7.1.1 Data Preparation

library(tidyverse)

We will use here a small and very clean toy dataset called Ruspini which is included in the R package cluster.

data(ruspini, package = "cluster")

The Ruspini data set, consisting of 75 points in four groups that is popular for illustrating clustering techniques. It is a very simple data set with well separated clusters. The original dataset has the points ordered by group. We can shuffle the data (rows) using sample_frac which samples by default 100%.

ruspini <- as_tibble(ruspini) |> 
  sample_frac()
ruspini
## # A tibble: 75 × 2
##        x     y
##    <int> <int>
##  1    44   149
##  2    13    49
##  3     5    63
##  4    36    72
##  5    31    60
##  6    66    23
##  7    34   141
##  8    97   122
##  9    86   132
## 10   101   115
## # ℹ 65 more rows

7.1.2 Data cleaning

ggplot(ruspini, aes(x = x, y = y)) + geom_point()

summary(ruspini)
##        x               y        
##  Min.   :  4.0   Min.   :  4.0  
##  1st Qu.: 31.5   1st Qu.: 56.5  
##  Median : 52.0   Median : 96.0  
##  Mean   : 54.9   Mean   : 92.0  
##  3rd Qu.: 76.5   3rd Qu.:141.5  
##  Max.   :117.0   Max.   :156.0

For most clustering algorithms it is necessary to handle missing values and outliers (e.g., remove the observations). For details see Section “Outlier removal” below. This data set has not missing values or strong outlier and looks like it has some very clear groups.

7.1.3 Scale data

Clustering algorithms use distances and the variables with the largest number range will dominate distance calculation. The summary above shows that this is not an issue for the Ruspini dataset with both, x and y, being roughly between 0 and 150. Most data analysts will still scale each column in the data to zero mean and unit standard deviation (z-scores).

Note: The standard scale() function scales a whole data matrix so we implement a function for a single vector and apply it to all numeric columns.

## I use this till tidyverse implements a scale function
scale_numeric <- function(x) {
  x |> mutate(across(where(is.numeric), 
                function(y) (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)))
}

ruspini_scaled <- ruspini |> 
  scale_numeric()
summary(ruspini_scaled)
##        x                 y          
##  Min.   :-1.6681   Min.   :-1.8074  
##  1st Qu.:-0.7665   1st Qu.:-0.7295  
##  Median :-0.0944   Median : 0.0816  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7088   3rd Qu.: 1.0158  
##  Max.   : 2.0366   Max.   : 1.3136

After scaling, most z-scores will fall in the range $[-3,3]$ (z-scores are measured in standard deviations from the mean), where $0$ means average.

7.2 K-means

k-means implicitly assumes Euclidean distances. We use $k = 4$ clusters and run the algorithm 10 times with random initialized centroids. The best result is returned.

km <- kmeans(ruspini_scaled, centers = 4, nstart = 10)
km
## K-means clustering with 4 clusters of sizes 20, 23, 17, 15
## 
## Cluster means:
##         x       y
## 1 -1.1386 -0.5560
## 2 -0.3595  1.1091
## 3  1.4194  0.4693
## 4  0.4607 -1.4912
## 
## Clustering vector:
##  [1] 2 1 1 1 1 4 2 3 3 3 3 4 2 3 2 2 1 1 3 3 1 4 1 4 2 4 3 3
## [29] 4 4 3 2 2 1 1 2 1 2 2 1 3 2 4 4 2 3 2 2 4 2 1 3 1 3 1 2
## [57] 3 3 1 1 4 4 2 1 4 2 2 1 2 2 1 4 3 4 2
## 
## Within cluster sum of squares by cluster:
## [1] 2.705 2.659 3.641 1.082
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"       
## [4] "withinss"     "tot.withinss" "betweenss"   
## [7] "size"         "iter"         "ifault"

km is an R object implemented as a list.

str(km)
## List of 9
##  $ cluster     : int [1:75] 2 1 1 1 1 4 2 3 3 3 ...
##  $ centers     : num [1:4, 1:2] -1.139 -0.36 1.419 0.461 -0.556 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:2] "x" "y"
##  $ totss       : num 148
##  $ withinss    : num [1:4] 2.71 2.66 3.64 1.08
##  $ tot.withinss: num 10.1
##  $ betweenss   : num 138
##  $ size        : int [1:4] 20 23 17 15
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The clustering vector is just a list element containing the cluster assignment for each data row and can be accessed using km$cluster. I add the cluster assignment as a column to the original dataset (I make it a factor since it represents a nominal label).

ruspini_clustered <- ruspini |> 
  add_column(cluster = factor(km$cluster))
ruspini_clustered
## # A tibble: 75 × 3
##        x     y cluster
##    <int> <int> <fct>  
##  1    44   149 2      
##  2    13    49 1      
##  3     5    63 1      
##  4    36    72 1      
##  5    31    60 1      
##  6    66    23 4      
##  7    34   141 2      
##  8    97   122 3      
##  9    86   132 3      
## 10   101   115 3      
## # ℹ 65 more rows
ggplot(ruspini_clustered, aes(x = x, y = y)) + 
  geom_point(aes(color = cluster))

Add the centroids to the plot. The centroids are scaled, so we need to unscale them to plot them over the original data. The second geom_points uses not the original data but specifies the centroids as its dataset.

unscale <- function(x, original_data) {
  if (ncol(x) != ncol(original_data))
    stop("Function needs matching columns!")
  x * matrix(apply(original_data, MARGIN = 2, sd, na.rm = TRUE), 
             byrow = TRUE, nrow = nrow(x), ncol = ncol(x)) + 
    matrix(apply(original_data, MARGIN = 2, mean, na.rm = TRUE), 
           byrow = TRUE, nrow = nrow(x), ncol = ncol(x))
}

centroids <- km$centers %>% 
                unscale(original_data = ruspini) %>% 
                as_tibble(rownames = "cluster")
centroids
## # A tibble: 4 × 3
##   cluster     x     y
##   <chr>   <dbl> <dbl>
## 1 1        20.2  65.0
## 2 2        43.9 146. 
## 3 3        98.2 115. 
## 4 4        68.9  19.4
ggplot(ruspini_clustered, aes(x = x, y = y)) + 
  geom_point(aes(color = cluster)) +
  geom_point(data = centroids, aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

The factoextra package provides also a good visualization with object labels and ellipses for clusters.

library(factoextra)
fviz_cluster(km, data = ruspini_scaled, centroids = TRUE, 
             repel = TRUE, ellipse.type = "norm")

7.2.1 Inspect Clusters

We inspect the clusters created by the 4-cluster k-means solution. The following code can be adapted to be used for other clustering methods.

Inspect the centroids with horizontal bar charts organized by cluster. To group the plots by cluster, we have to change the data format to the “long”-format using a pivot operation. I use colors to match the clusters in the scatter plots.

ggplot(pivot_longer(centroids, 
                    cols = c(x, y), 
                    names_to = "feature"),
  #aes(x = feature, y = value, fill = cluster)) +
  aes(x = value, y = feature, fill = cluster)) +
  geom_bar(stat = "identity") +
  facet_grid(cols = vars(cluster))

7.2.2 Extract a Single Cluster

You need is to filter the rows corresponding to the cluster index. The next example calculates summary statistics and then plots all data points of cluster 1.

cluster1 <- ruspini_clustered |> 
  filter(cluster == 1)
cluster1
## # A tibble: 20 × 3
##        x     y cluster
##    <int> <int> <fct>  
##  1    13    49 1      
##  2     5    63 1      
##  3    36    72 1      
##  4    31    60 1      
##  5    27    72 1      
##  6     9    77 1      
##  7    28    76 1      
##  8    24    58 1      
##  9    27    55 1      
## 10    18    61 1      
## 11    15    75 1      
## 12    13    69 1      
## 13    12    88 1      
## 14    10    59 1      
## 15    19    65 1      
## 16    32    61 1      
## 17    22    74 1      
## 18     4    53 1      
## 19    28    60 1      
## 20    30    52 1
summary(cluster1)
##        x              y        cluster
##  Min.   : 4.0   Min.   :49.0   1:20   
##  1st Qu.:12.8   1st Qu.:58.8   2: 0   
##  Median :20.5   Median :62.0   3: 0   
##  Mean   :20.1   Mean   :65.0   4: 0   
##  3rd Qu.:28.0   3rd Qu.:72.5          
##  Max.   :36.0   Max.   :88.0
ggplot(cluster1, aes(x = x, y = y)) + geom_point() +
  coord_cartesian(xlim = c(0, 120), ylim = c(0, 160))

What happens if we try to cluster with 8 centers?

fviz_cluster(kmeans(ruspini_scaled, centers = 8), data = ruspini_scaled,
  centroids = TRUE,  geom = "point", ellipse.type = "norm")
## Too few points to calculate an ellipse

7.3 Agglomerative Hierarchical Clustering

Hierarchical clustering starts with a distance matrix. dist() defaults to method = "euclidean".

Notes:

Distance matrices become very large quickly (the space complexity is $O(n^2)$ where $n$ is the number if data points). It is only possible to calculate and store the matrix for small to medium data sets (maybe a few hundred thousand data points) in main memory. If your data is too large then you can use sampling to reduce the number of points to cluster.
The data needs to be scaled since we compute distances.

7.3.1 Creating a Dendrogram

d <- dist(ruspini_scaled)

hclust() implements agglomerative hierarchical clustering. The linkage function defines how the distance between two clusters (sets of points) is calculated. We specify the complete-link method where the distance between two groups is measured by the distance between the two farthest apart points in the two groups.

hc <- hclust(d, method = "complete")

Hierarchical clustering does not return cluster assignments but a dendrogram object which shows how the objects on the x-axis are successively joined together when going up along the y-axis. The y-axis represents the distance at which groups of points are joined together. The standard plot function displays the dendrogram.

plot(hc)

The package factoextra provides a ggplot function to plot dendrograms.

fviz_dend(hc)

More plotting options for dendrograms, including plotting parts of large dendrograms can be found here.

7.3.2 Extracting a Partitional Clustering

Partitional clusterings with cluster assignments can be extracted from a dendrogram by cutting the dendrogram horizontally using cutree(). Here we cut it into four parts and add the cluster id to the data.

clusters <- cutree(hc, k = 4)
cluster_complete <- ruspini_scaled |>
  add_column(cluster = factor(clusters))
cluster_complete
## # A tibble: 75 × 3
##         x      y cluster
##     <dbl>  <dbl> <fct>  
##  1 -0.357  1.17  1      
##  2 -1.37  -0.883 2      
##  3 -1.64  -0.596 2      
##  4 -0.619 -0.411 2      
##  5 -0.783 -0.658 2      
##  6  0.365 -1.42  3      
##  7 -0.685  1.01  1      
##  8  1.38   0.615 4      
##  9  1.02   0.821 4      
## 10  1.51   0.472 4      
## # ℹ 65 more rows

The 4 different clusters can be shown in the dendrogram using different colors.

fviz_dend(hc, k = 4)

We can also color in the scatterplot.

ggplot(cluster_complete, aes(x, y, color = cluster)) +
  geom_point()

The package factoextra provides an alternative visualization. Since it needs a partition object as the first argument, we create one as a simple list.

fviz_cluster(list(data = ruspini_scaled, 
                  cluster = cutree(hc, k = 4)), 
             geom = "point")

Here are the results if we use the single-link method to join clusters. Single-link measures the distance between two groups by the by the distance between the two closest points from the two groups.

hc_single <- hclust(d, method = "single")
fviz_dend(hc_single, k = 4)

fviz_cluster(list(data = ruspini_scaled, 
                  cluster = cutree(hc_single, k = 4)), 
             geom = "point")

The most important difference between complete-link and single-link is that complete-link prefers globular clusters and single-link shows chaining (see the staircase pattern in the dendrogram).

7.4 DBSCAN

DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise.” It groups together points that are closely packed together and treats points in low-density regions as outliers. DBSCAN is implemented in package dbscan.

library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

7.4.1 DBSCAN Parameters

DBSCAN has two parameters that interact with each other. Changing one typically means that the other one also has to be adjusted.

minPts defines how many points are needed in the epsilon neighborhood to make a point a core point for a cluster. It is often chosen as a smoothing parameter, where larger values smooth the density estimation more but also ignore smaller structures with less then minPts.

eps is the radius of the epsilon neighborhood around a point in which the number of points is counted.

Users typically first select minPts. We use here minPts = 4. To decide on epsilon, the knee in the kNN distance plot is often used. All points are sorted by their kNN distance. The idea is that points with a low kNN distance are located in a dense area which should become a cluster. Points with a large kNN distance are in a low density area and most likely represent outliers and noise.

kNNdistplot(ruspini_scaled, minPts = 4)
abline(h = .32, col = "red")

Note that minPts contains the point itself, while the k-nearest neighbor distance calculation does not. The plot uses therefor k = minPts - 1.

The knee is visible around eps = .32 (shown by the manually added red line). The points to the left of the intersection of the k-nearest neighbor distance line and the red line are the points that will be core points during clustering at the specified parameters.

7.4.2 Cluster using DBSCAN

Clustering with dbscan() returns a dbscan object.

db <- dbscan(ruspini_scaled, eps = .32, minPts = 4)
db
## DBSCAN clustering for 75 objects.
## Parameters: eps = 0.32, minPts = 4
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 5 noise points.
## 
##  0  1  2  3  4 
##  5 23 20 15 12 
## 
## Available fields: cluster, eps, minPts, metric,
##                   borderPoints
str(db)
## List of 5
##  $ cluster     : int [1:75] 1 2 2 2 2 3 1 4 0 4 ...
##  $ eps         : num 0.32
##  $ minPts      : num 4
##  $ metric      : chr "euclidean"
##  $ borderPoints: logi TRUE
##  - attr(*, "class")= chr [1:2] "dbscan_fast" "dbscan"

The cluster element is the cluster assignment with cluster labels. The special cluster label 0 means that the point is noise and not assigned to a cluster.

ggplot(ruspini_scaled |> add_column(cluster = factor(db$cluster)),
  aes(x, y, color = cluster)) + geom_point()

DBSCAN found 5 noise points. fviz_cluster() can be used to create a ggplot visualization.

fviz_cluster(db, ruspini_scaled, geom = "point")

Different values for minPts and eps can lead to vastly different clusterings. Often, a misspecification leads to all points being noise points or a single cluster. Also, if the clusters are not well separated, then DBSCAN has a hard time splitting them.

7.5 Cluster Evaluation

7.5.1 Unsupervised Cluster Evaluation

This is also often called internal cluster evaluation since it does not use extra external labels.

We will evaluate the k-means clustering from above.

ggplot(ruspini_scaled, 
       aes(x, y, color = factor(km$cluster))) + 
  geom_point()

d <- dist(ruspini_scaled)

The two most popular quality metrics are the within-cluster sum of squares (WCSS) used as the optimization objective by $k$-means and the average silhouette width. Look at within.cluster.ss and avg.silwidth below.

Some notes about the code:

I do not load the package fpc using library() since it would mask the dbscan() function in package dbscan. Instead I use the namespace operator ::.
The clustering (second argument below) has to be supplied as a vector of integers (cluster IDs) and cannot be a factor (to make sure, we can use as.integer()).

# library(fpc)
fpc::cluster.stats(d, as.integer(km$cluster))
## $n
## [1] 75
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 20 23 17 15
## 
## $min.cluster.size
## [1] 15
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.1193 1.1591 1.4627 0.8359
## 
## $average.distance
## [1] 0.4824 0.4286 0.5806 0.3564
## 
## $median.distance
## [1] 0.4492 0.3934 0.5024 0.3380
## 
## $separation
## [1] 1.1577 0.7676 0.7676 1.1577
## 
## $average.toother
## [1] 2.157 2.149 2.293 2.308
## 
## $separation.matrix
##       [,1]   [,2]   [,3]  [,4]
## [1,] 0.000 1.2199 1.3397 1.158
## [2,] 1.220 0.0000 0.7676 1.958
## [3,] 1.340 0.7676 0.0000 1.308
## [4,] 1.158 1.9577 1.3084 0.000
## 
## $ave.between.matrix
##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.000 1.887 2.772 1.874
## [2,] 1.887 0.000 1.925 2.750
## [3,] 2.772 1.925 0.000 2.220
## [4,] 1.874 2.750 2.220 0.000
## 
## $average.between
## [1] 2.219
## 
## $average.within
## [1] 0.463
## 
## $n.between
## [1] 2091
## 
## $n.within
## [1] 684
## 
## $max.diameter
## [1] 1.463
## 
## $min.separation
## [1] 0.7676
## 
## $within.cluster.ss
## [1] 10.09
## 
## $clus.avg.silwidths
##      1      2      3      4 
## 0.7211 0.7455 0.6813 0.8074 
## 
## $avg.silwidth
## [1] 0.7368
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.8416
## 
## $dunn
## [1] 0.5248
## 
## $dunn2
## [1] 3.228
## 
## $entropy
## [1] 1.373
## 
## $wb.ratio
## [1] 0.2086
## 
## $ch
## [1] 323.6
## 
## $cwidegap
## [1] 0.2612 0.3153 0.4150 0.2352
## 
## $widestgap
## [1] 0.415
## 
## $sindex
## [1] 0.8583
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

The correlation between the distances and a 0-1-vector cluster incidence matrix is called pearsongamma. Some numbers are NULL. These are measures only available for supervised evaluation. Read the man page for cluster.stats() for an explanation of all the available indices.

We can compare different clusterings.

sapply(
  list(
    km_4 = km$cluster,
    hc_compl_4 = cutree(hc, k = 4),
    hc_compl_5 = cutree(hc, k = 5),
    hc_single_5 = cutree(hc_single, k = 5)
  ),
  FUN = function(x)
    fpc::cluster.stats(d, x))[c("within.cluster.ss", 
                                "avg.silwidth", 
                                "pearsongamma", 
                                "dunn"), ]
##                   km_4   hc_compl_4 hc_compl_5 hc_single_5
## within.cluster.ss 10.09  10.09      8.314      7.791      
## avg.silwidth      0.7368 0.7368     0.6642     0.6886     
## pearsongamma      0.8416 0.8416     0.8042     0.816      
## dunn              0.5248 0.5248     0.1988     0.358

With 4 clusters, k-means and hierarchical clustering produce exactly the same clustering. While the two different hierarchical methods with 5 clusters produce a smaller WCSS, they are actually worse given all three other measures.

Next, we look at the silhouette using a silhouette plot.

library(cluster)
## 
## Attaching package: 'cluster'
## The following object is masked _by_ '.GlobalEnv':
## 
##     ruspini
plot(silhouette(km$cluster, d))

Note: The silhouette plot does not show correctly in R Studio if you have too many objects (bars are missing). I will work when you open a new plotting device with windows(), x11() or quartz().

ggplot visualization using factoextra

fviz_silhouette(silhouette(km$cluster, d))
##   cluster size ave.sil.width
## 1       1   20          0.72
## 2       2   23          0.75
## 3       3   17          0.68
## 4       4   15          0.81

Unsupervised cluster evaluation can be done by inspecting and visualizing the proximity matrix. We can inspect the distance matrix between the first 5 objects.

as.matrix(d)[1:5, 1:5]
##       1      2      3      4      5
## 1 0.000 2.2910 2.1801 1.6026 1.8765
## 2 2.291 0.0000 0.3891 0.8897 0.6319
## 3 2.180 0.3891 0.0000 1.0330 0.8546
## 4 1.603 0.8897 1.0330 0.0000 0.2959
## 5 1.876 0.6319 0.8546 0.2959 0.0000

A false-color image visualizes each value in the matrix as a pixel with the color representing the value.

library(seriation)
## 
## Attaching package: 'seriation'
## The following object is masked from 'package:lattice':
## 
##     panel.lines
ggpimage(d)

Rows and columns are the objects as they are ordered in the data set. The diagonal represents the distance between an object and itself and has by definition a distance of 0 (dark line). Visualizing the unordered distance matrix does not show much structure, but we can reorder the matrix (rows and columns) using the k-means cluster labels from cluster 1 to 4. A clear block structure representing the clusters becomes visible.

ggpimage(d, order=order(km$cluster))

Plot function dissplot in package seriation rearranges the matrix and adds lines and cluster labels. In the lower half of the plot, it shows average dissimilarities between clusters. The function organizes the objects by cluster and then reorders clusters and objects within clusters so that more similar objects are closer together.

ggdissplot(d, labels = km$cluster, 
         options = list(main = "k-means with k=4"))

The reordering by dissplot makes the misspecification of k visible as blocks.

ggdissplot(d, 
         labels = kmeans(ruspini_scaled, centers = 3)$cluster)

ggdissplot(d, 
         labels = kmeans(ruspini_scaled, centers = 9)$cluster)

Using factoextra

fviz_dist(d)

7.5.2 Determining the Correct Number of Clusters

The user needs to specify the number of clusters for most clustering algorithms. Determining the number of clusters in a data set is therefor an important task before we can cluster data. We will apply different methods to the scaled Ruspini data set.

ggplot(ruspini_scaled, aes(x, y)) + geom_point()

## We will use different methods and try 1-10 clusters.
set.seed(1234)
ks <- 2:10

7.5.2.1 Elbow Method: Within-Cluster Sum of Squares

A method often used for k-means is to calculate the within-cluster sum of squares for different numbers of clusters and look for the knee or elbow in the plot. We use nstart = 5 to restart k-means 5 times and return the best solution.

WCSS <- sapply(ks, FUN = function(k) {
  kmeans(ruspini_scaled, centers = k, nstart = 5)$tot.withinss
  })

ggplot(tibble(ks, WCSS), aes(ks, WCSS)) + 
  geom_line() +
  geom_vline(xintercept = 4, color = "red", linetype = 2)

7.5.2.2 Average Silhouette Width

Another popular method (often preferred for clustering methods starting with a distance matrix) is to plot the average silhouette width for different numbers of clusters and look for the maximum in the plot.

ASW <- sapply(ks, FUN=function(k) {
  fpc::cluster.stats(d, 
                     kmeans(ruspini_scaled, 
                            centers = k, 
                            nstart = 5)$cluster)$avg.silwidth
  })

best_k <- ks[which.max(ASW)]
best_k
## [1] 4
ggplot(tibble(ks, ASW), aes(ks, ASW)) + 
  geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

7.5.2.3 Dunn Index

The Dunn index is another internal measure given by the smallest separation between clusters scaled by the largest cluster diameter.

DI <- sapply(ks, FUN = function(k) {
  fpc::cluster.stats(d, 
                     kmeans(ruspini_scaled, centers = k, 
                            nstart = 5)$cluster)$dunn
})

best_k <- ks[which.max(DI)]
ggplot(tibble(ks, DI), aes(ks, DI)) + 
  geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

7.5.2.4 Gap Statistic

The Gap statisticCompares the change in within-cluster dispersion with that expected from a null model (see clusGap()). The default method is to choose the smallest k such that its value Gap(k) is not more than 1 standard error away from the first local maximum.

library(cluster)
k <- clusGap(ruspini_scaled, 
             FUN = kmeans,  
             nstart = 10, 
             K.max = 10)
k
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = ruspini_scaled, FUNcluster = kmeans, K.max = 10, nstart = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##        logW E.logW      gap  SE.sim
##  [1,] 3.498  3.467 -0.03080 0.03570
##  [2,] 3.074  3.150  0.07625 0.03744
##  [3,] 2.678  2.903  0.22466 0.03796
##  [4,] 2.106  2.704  0.59710 0.03633
##  [5,] 1.987  2.570  0.58271 0.03474
##  [6,] 1.864  2.451  0.58713 0.03651
##  [7,] 1.732  2.348  0.61558 0.03954
##  [8,] 1.676  2.256  0.58048 0.04132
##  [9,] 1.578  2.172  0.59377 0.04090
## [10,] 1.472  2.093  0.62144 0.03934
plot(k)

There have been many other methods and indices proposed to determine the number of clusters. See, e.g., package NbClust.

7.5.3 Clustering Tendency

Most clustering algorithms will always produce a clustering, even if the data does not contain a cluster structure. It is typically good to check cluster tendency before attempting to cluster the data.

We use again the smiley data.

library(mlbench)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)$x
colnames(shapes) <- c("x", "y")
shapes <- as_tibble(shapes)

7.5.3.1 Scatter plots

The first step is visual inspection using scatter plots.

ggplot(shapes, aes(x = x, y = y)) + geom_point()

Cluster tendency is typically indicated by several separated point clouds. Often an appropriate number of clusters can also be visually obtained by counting the number of point clouds. We see four clusters, but the mouth is not convex/spherical and thus will pose a problems to algorithms like k-means.

If the data has more than two features then you can use a pairs plot (scatterplot matrix) or look at a scatterplot of the first two principal components using PCA.

7.5.3.2 Visual Analysis for Cluster Tendency Assessment (VAT)

VAT reorders the objects to show potential clustering tendency as a block structure (dark blocks along the main diagonal). We scale the data before using Euclidean distance.

library(seriation)

d_shapes <- dist(scale(shapes))
ggVAT(d_shapes)

iVAT uses the largest distances for all possible paths between two objects instead of the direct distances to make the block structure better visible.

ggiVAT(d_shapes)

7.5.3.3 Hopkins statistic

factoextra can also create a VAT plot and calculate the Hopkins statistic to assess clustering tendency. For the Hopkins statistic, a sample of size $n$ is drawn from the data and then compares the nearest neighbor distribution with a simulated dataset drawn from a random uniform distribution (see detailed explanation). A values >.5 indicates usually a clustering tendency.

get_clust_tendency(shapes, n = 10)
## $hopkins_stat
## [1] 0.9074
## 
## $plot

Both plots show a strong cluster structure with 4 clusters.

7.5.3.4 Data Without Clustering Tendency

data_random <- tibble(x = runif(500), y = runif(500))
ggplot(data_random, aes(x, y)) + geom_point()

No point clouds are visible, just noise.

d_random <- dist(data_random)
ggVAT(d_random)

ggiVAT(d_random)

get_clust_tendency(data_random, n = 10, graph = FALSE)
## $hopkins_stat
## [1] 0.4642
## 
## $plot
## NULL

There is very little clustering structure visible indicating low clustering tendency and clustering should not be performed on this data. However, k-means can be used to partition the data into $k$ regions of roughly equivalent size. This can be used as a data-driven discretization of the space.

7.5.3.5 k-means on Data Without Clustering Tendency

What happens if we perform k-means on data that has no inherent clustering structure?

km <- kmeans(data_random, centers = 4)

random_clustered<- data_random |> 
  add_column(cluster = factor(km$cluster))
ggplot(random_clustered, aes(x = x, y = y, color = cluster)) + 
  geom_point()

k-means discretizes the space into similarly sized regions.

7.5.4 Supervised Cluster Evaluation

Also called external cluster validation since it uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a known class label not provided to the clustering algorithm.

We use an artificial data set with known groups.

library(mlbench)
set.seed(1234)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)
plot(shapes)

First, we prepare the data and hide the known class label which we will used for evaluation after clustering.

truth <- as.integer(shapes$class)
shapes <- shapes$x
colnames(shapes) <- c("x", "y")

shapes <- shapes |> scale() |> as_tibble()

ggplot(shapes, aes(x, y)) + geom_point()

Find optimal number of Clusters for k-means

ks <- 2:20

Use within sum of squares (look for the knee)

WCSS <- sapply(ks, FUN = function(k) {
  kmeans(shapes, centers = k, nstart = 10)$tot.withinss
})

ggplot(tibble(ks, WCSS), aes(ks, WCSS)) + geom_line()

Looks like it could be 7 clusters?

km <- kmeans(shapes, centers = 7, nstart = 10)

ggplot(shapes |> add_column(cluster = factor(km$cluster)), 
       aes(x, y, color = cluster)) +
  geom_point()

The mouth is an issue for k-means. We could use hierarchical clustering with single-linkage because of the mouth is non-convex and chaining may help.

d <- dist(shapes)
hc <- hclust(d, method = "single")

Find optimal number of clusters

ASW <- sapply(ks, FUN = function(k) {
  fpc::cluster.stats(d, cutree(hc, k))$avg.silwidth
})

ggplot(tibble(ks, ASW), aes(ks, ASW)) + geom_line()

The maximum is clearly at 4 clusters.

hc_4 <- cutree(hc, 4)

ggplot(shapes |> add_column(cluster = factor(hc_4)), 
       aes(x, y, color = cluster)) +
  geom_point()

Compare with ground truth with the adjusted Rand index (ARI, also corrected Rand index), the variation of information (VI) index, mutual information (MI), entropy and purity.

cluster_stats() computes ARI and VI as comparative measures. I define functions for the purity and the entropy of a clustering given the ground truth here:

purity <- function(cluster, truth, show_table = FALSE) {
  if (length(cluster) != length(truth))
    stop("Cluster vector and ground truth vectors are not of the same length!")
  
  # tabulate
  tbl <- table(cluster, truth)
  if(show_table)
    print(tbl)
  
  # find majority class
  majority <- apply(tbl, 1, max)
  sum(majority) / length(cluster)
}

entropy <- function(cluster, truth, show_table = FALSE) {
  if (length(cluster) != length(truth))
    stop("Cluster vector and ground truth vectors are not of the same length!")
  
  # calculate membership probability of cluster to class
  tbl <- table(cluster, truth)
  p <- sweep(tbl, 2, colSums(tbl), "/")
  
  if(show_table)
    print(p)

  # calculate cluster entropy
  e <- -p * log(p, 2)
  e <- rowSums(e, na.rm = TRUE)
  
  # weighted sum over clusters
  w <- table(cluster) / length(cluster)
  sum(w * e)
}

We calculate the measures and add for comparison two random “clusterings” with 4 and 6 clusters.

random_4 <- sample(1:4, nrow(shapes), replace = TRUE)
random_6 <- sample(1:6, nrow(shapes), replace = TRUE)

r <- rbind(
  truth = c(
    unlist(fpc::cluster.stats(d, truth, 
                              truth, compareonly = TRUE)),
    purity = purity(truth, truth),
    entropy = entropy(truth, truth)
  ),
  
  kmeans_7 = c(
    unlist(fpc::cluster.stats(d, km$cluster, 
                              truth, compareonly = TRUE)),
    purity = purity(km$cluster, truth),
    entropy = entropy(km$cluster, truth)
    ),
  hc_4 = c(
    unlist(fpc::cluster.stats(d, hc_4, 
                              truth, compareonly = TRUE)),
    purity = purity(hc_4, truth),
    entropy = entropy(hc_4, truth)
    ),
  random_4 = c(
    unlist(fpc::cluster.stats(d, random_4, 
                              truth, compareonly = TRUE)),
    purity = purity(random_4, truth),
    entropy = entropy(random_4, truth)
    ),
  random_6 = c(
    unlist(fpc::cluster.stats(d, random_6, 
                              truth, compareonly = TRUE)),
    purity = purity(random_6, truth),
    entropy = entropy(random_6, truth)
    )
  )
r
##          corrected.rand     vi purity entropy
## truth          1.000000 0.0000  1.000  0.0000
## kmeans_7       0.638229 0.5709  1.000  0.2088
## hc_4           1.000000 0.0000  1.000  0.0000
## random_4      -0.003235 2.6832  0.418  1.9895
## random_6      -0.002125 3.0763  0.418  1.7129

Notes:

Comparing the ground truth to itself produces perfect scores.
Hierarchical clustering found the perfect clustering.
Entropy and purity are heavily impacted by the number of clusters (more clusters improve the metric) and the fact that the mouth has 41.8% of all the data points becoming automatically the majority class for purity.
The adjusted rand index shows clearly that the random clusterings have no relationship with the ground truth (very close to 0). This is a very helpful property and explains why the ARI is the most popular measure.

Read the manual page for fpc::cluster.stats() for an explanation of all the available indices.

7.6 More Clustering Algorithms*

Note: Some of these methods are covered in Chapter 8 of the textbook.

7.6.1 Partitioning Around Medoids (PAM)

PAM tries to solve the $k$-medoids problem. The problem is similar to $k$-means, but uses medoids instead of centroids to represent clusters. Like hierarchical clustering, it typically works with a precomputed distance matrix. An advantage is that you can use any distance metric not just Euclidean distances. Note: The medoid is the most central data point in the middle of the cluster. PAM is a lot more computationally expensive compared to k-means.

library(cluster)

d <- dist(ruspini_scaled)
str(d)
##  'dist' num [1:2775] 2.29 2.18 1.6 1.88 2.69 ...
##  - attr(*, "Size")= int 75
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "Euclidean"
##  - attr(*, "call")= language dist(x = ruspini_scaled)
p <- pam(d, k = 4)
p
## Medoids:
##      ID   
## [1,]  1  1
## [2,] 55 55
## [3,] 22 22
## [4,] 19 19
## Clustering vector:
##  [1] 1 2 2 2 2 3 1 4 4 4 4 3 1 4 1 1 2 2 4 4 2 3 2 3 1 3 4 4
## [29] 3 3 4 1 1 2 2 1 2 1 1 2 4 1 3 3 1 4 1 1 3 1 2 4 2 4 2 1
## [57] 4 4 2 2 3 3 1 2 3 1 1 2 1 1 2 3 4 3 1
## Objective function:
##  build   swap 
## 0.4423 0.3187 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective" 
## [5] "isolation"  "clusinfo"   "silinfo"    "diss"      
## [9] "call"

Extract the clustering and medoids for visualization.

ruspini_clustered <- ruspini_scaled |> 
  add_column(cluster = factor(p$cluster))

medoids <- as_tibble(ruspini_scaled[p$medoids, ], 
                     rownames = "cluster")
medoids
## # A tibble: 4 × 3
##   cluster      x      y
##   <chr>    <dbl>  <dbl>
## 1 1       -0.357  1.17 
## 2 2       -1.18  -0.555
## 3 3        0.463 -1.46 
## 4 4        1.45   0.554
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = medoids, aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

Alternative visualization using fviz_cluster().

## __Note:__ `fviz_cluster` needs the original data.
fviz_cluster(c(p, list(data = ruspini_scaled)), 
             geom = "point", 
             ellipse.type = "norm")

7.6.2 Gaussian Mixture Models

library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
## 
##     map

Gaussian mixture models assume that the data set is the result of drawing data from a set of Gaussian distributions where each distribution represents a cluster. Estimation algorithms try to identify the location parameters of the distributions and thus can be used to find clusters. Mclust() uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.

m <- Mclust(ruspini_scaled)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 5
## components: 
## 
##  log-likelihood  n df    BIC    ICL
##          -91.26 75 16 -251.6 -251.7
## 
## Clustering table:
##  1  2  3  4  5 
## 23 20 15 14  3
plot(m, what = "classification")

Rerun with a fixed number of 4 clusters

m <- Mclust(ruspini_scaled, G=4)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 4
## components: 
## 
##  log-likelihood  n df    BIC    ICL
##          -101.6 75 13 -259.3 -259.3
## 
## Clustering table:
##  1  2  3  4 
## 23 20 15 17
plot(m, what = "classification")

7.6.3 Spectral Clustering

Spectral clustering works by embedding the data points of the partitioning problem into the subspace of the k largest eigenvectors of a normalized affinity/kernel matrix. Then uses a simple clustering method like k-means.

library("kernlab")
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:arulesSequences':
## 
##     size
## The following object is masked from 'package:scales':
## 
##     alpha
## The following object is masked from 'package:arules':
## 
##     size
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
cluster_spec <- specc(as.matrix(ruspini_scaled), centers = 4)
cluster_spec
## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 4 3 3 3 3 2 4 1 1 1 1 2 4 1 4 4 3 3 1 1 3 2 3 2 4 2 1 1 2 2 1 4 4 3 3 4 3 4 4 3 1 4 2 2 4 1 4 4 2 4 3 1 3 1 3 4 1 1 3 3 2 2 4 3 2 4 4 3 4 4 3 2 1 2 4 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  20.8835033729148 
## 
## Centers:  
##         [,1]    [,2]
## [1,]  1.4194  0.4693
## [2,]  0.4607 -1.4912
## [3,] -1.1386 -0.5560
## [4,] -0.3595  1.1091
## 
## Cluster size:  
## [1] 17 15 20 23
## 
## Within-cluster sum of squares:  
## [1] 18.93 54.85 11.34 45.48
ggplot(ruspini_scaled |> 
         add_column(cluster = factor(cluster_spec)),
       aes(x, y, color = cluster)) + 
  geom_point()

7.6.4 Deep Clustering Methods

Deep clustering is often used for high-dimensional data. It uses a deep autoencoder to learn a cluster-friendly representation of the data before applying a standard clustering algorithm (e.g., k-means) on the embedded data. The autoencoder is typically implemented using keras and the autoencoder’s loss function is modified to include the clustering loss which will make the embedding more clusterable.

7.6.5 Fuzzy C-Means Clustering

The fuzzy clustering version of the k-means clustering problem. Each data point has a degree of membership to for each cluster.

library("e1071")

cluster_cmeans <- cmeans(as.matrix(ruspini_scaled), centers = 4)
cluster_cmeans
## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##         x       y
## 1 -1.1371 -0.5550
## 2 -0.3763  1.1143
## 3  0.4552 -1.4760
## 4  1.5048  0.5161
## 
## Memberships:
##               1         2         3         4
##  [1,] 0.0009643 9.977e-01 0.0004512 8.879e-04
##  [2,] 0.9148680 3.002e-02 0.0405062 1.461e-02
##  [3,] 0.8877281 4.917e-02 0.0431101 1.999e-02
##  [4,] 0.7683893 9.312e-02 0.0971203 4.137e-02
##  [5,] 0.8900960 3.663e-02 0.0549604 1.831e-02
##  [6,] 0.0038612 1.664e-03 0.9921767 2.298e-03
##  [7,] 0.0376056 9.294e-01 0.0133136 1.973e-02
##  [8,] 0.0032212 7.444e-03 0.0047479 9.846e-01
##  [9,] 0.0394179 1.267e-01 0.0461290 7.877e-01
## [10,] 0.0002500 5.072e-04 0.0004110 9.988e-01
## [11,] 0.0075741 1.388e-02 0.0135387 9.650e-01
## [12,] 0.0254632 9.526e-03 0.9531429 1.187e-02
## [13,] 0.0419832 9.187e-01 0.0155189 2.380e-02
## [14,] 0.1075095 1.739e-01 0.1773899 5.412e-01
## [15,] 0.0483850 9.103e-01 0.0168055 2.448e-02
## [16,] 0.0371539 9.245e-01 0.0146294 2.371e-02
## [17,] 0.9426524 2.538e-02 0.0220711 9.896e-03
## [18,] 0.8953110 5.308e-02 0.0336348 1.798e-02
## [19,] 0.0006092 1.324e-03 0.0009436 9.971e-01
## [20,] 0.0106794 1.929e-02 0.0192355 9.508e-01
## [21,] 0.9045073 4.507e-02 0.0339770 1.645e-02
## [22,] 0.0001096 5.052e-05 0.9997656 7.424e-05
## [23,] 0.9731699 9.545e-03 0.0127747 4.511e-03
## [24,] 0.0030791 1.396e-03 0.9934965 2.029e-03
## [25,] 0.0713882 7.016e-01 0.0509709 1.760e-01
## [26,] 0.0205931 1.087e-02 0.9503616 1.818e-02
## [27,] 0.0392620 9.618e-02 0.0536215 8.109e-01
## [28,] 0.1283082 2.177e-01 0.1837576 4.703e-01
## [29,] 0.0249349 1.136e-02 0.9472520 1.646e-02
## [30,] 0.0384838 2.343e-02 0.8921802 4.591e-02
## [31,] 0.0130801 2.683e-02 0.0205434 9.395e-01
## [32,] 0.0033849 9.924e-01 0.0014973 2.767e-03
## [33,] 0.0151633 9.597e-01 0.0078901 1.728e-02
## [34,] 0.9302684 2.248e-02 0.0358179 1.143e-02
## [35,] 0.9920941 3.143e-03 0.0034030 1.360e-03
## [36,] 0.0112519 9.693e-01 0.0059278 1.354e-02
## [37,] 0.9540725 2.256e-02 0.0155206 7.844e-03
## [38,] 0.0920675 7.514e-01 0.0519053 1.047e-01
## [39,] 0.0758512 8.622e-01 0.0256682 3.625e-02
## [40,] 0.9626044 1.710e-02 0.0138101 6.486e-03
## [41,] 0.0058692 1.130e-02 0.0099614 9.729e-01
## [42,] 0.0204611 9.395e-01 0.0114699 2.852e-02
## [43,] 0.0241504 1.011e-02 0.9523623 1.337e-02
## [44,] 0.0173389 9.017e-03 0.9591007 1.454e-02
## [45,] 0.0015600 9.964e-01 0.0007051 1.322e-03
## [46,] 0.0184319 3.390e-02 0.0318375 9.158e-01
## [47,] 0.0120649 9.754e-01 0.0047499 7.761e-03
## [48,] 0.0333622 8.915e-01 0.0199935 5.510e-02
## [49,] 0.0498123 1.727e-02 0.9125422 2.038e-02
## [50,] 0.0672233 7.551e-01 0.0448213 1.329e-01
## [51,] 0.8288352 9.817e-02 0.0452768 2.771e-02
## [52,] 0.0061528 1.483e-02 0.0087249 9.703e-01
## [53,] 0.9343415 2.687e-02 0.0272619 1.153e-02
## [54,] 0.0011602 2.456e-03 0.0018406 9.945e-01
## [55,] 0.9989320 4.474e-04 0.0004367 1.839e-04
## [56,] 0.0044839 9.885e-01 0.0022374 4.752e-03
## [57,] 0.0034560 8.073e-03 0.0050417 9.834e-01
## [58,] 0.0229230 4.091e-02 0.0405213 8.956e-01
## [59,] 0.8727657 4.260e-02 0.0633412 2.130e-02
## [60,] 0.9712833 1.357e-02 0.0102381 4.907e-03
## [61,] 0.0287389 1.083e-02 0.9470184 1.341e-02
## [62,] 0.0203875 1.111e-02 0.9492342 1.926e-02
## [63,] 0.0256390 9.467e-01 0.0103553 1.729e-02
## [64,] 0.8604239 5.522e-02 0.0593849 2.497e-02
## [65,] 0.0082410 3.347e-03 0.9839900 4.423e-03
## [66,] 0.0114703 9.753e-01 0.0048169 8.408e-03
## [67,] 0.0103266 9.787e-01 0.0041349 6.878e-03
## [68,] 0.9397726 2.104e-02 0.0290836 1.010e-02
## [69,] 0.0095409 9.763e-01 0.0046348 9.540e-03
## [70,] 0.0046268 9.889e-01 0.0021822 4.269e-03
## [71,] 0.8665154 3.837e-02 0.0740308 2.108e-02
## [72,] 0.0183428 1.016e-02 0.9537438 1.775e-02
## [73,] 0.0654674 1.100e-01 0.1188607 7.056e-01
## [74,] 0.0031669 1.347e-03 0.9936326 1.853e-03
## [75,] 0.0247839 9.272e-01 0.0139396 3.405e-02
## 
## Closest hard clustering:
##  [1] 2 1 1 1 1 3 2 4 4 4 4 3 2 4 2 2 1 1 4 4 1 3 1 3 2 3 4 4
## [29] 3 3 4 2 2 1 1 2 1 2 2 1 4 2 3 3 2 4 2 2 3 2 1 4 1 4 1 2
## [57] 4 4 1 1 3 3 2 1 3 2 2 1 2 2 1 3 4 3 2
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership" 
## [5] "iter"        "withinerror" "call"

Plot membership (shown as small pie charts)

library("scatterpie")
## scatterpie v0.2.4 Learn more at https://yulab-smu.top/
ggplot()  +
  geom_scatterpie(
    data = cbind(ruspini_scaled, cluster_cmeans$membership),
    aes(x = x, y = y), 
    cols = colnames(cluster_cmeans$membership), 
    legend_name = "Membership") + 
  coord_equal()

7.7 Scale Issues in Clustering*

To demonstrate this issue, I will use the unscaled Ruspini dataset and assume that we measure x in millimeters and y in meters. I do this by multiplying x by 1000.

ruspini_scale_issue <- ruspini |> mutate(x = x * 1000)
summary(ruspini_scale_issue)
##        x                y        
##  Min.   :  4000   Min.   :  4.0  
##  1st Qu.: 31500   1st Qu.: 56.5  
##  Median : 52000   Median : 96.0  
##  Mean   : 54880   Mean   : 92.0  
##  3rd Qu.: 76500   3rd Qu.:141.5  
##  Max.   :117000   Max.   :156.0

If we cluster the data now, then the algorithm fails!

km <- kmeans(ruspini_scale_issue, centers = 4)

ruspini_clustered <- ruspini_scale_issue |> 
  add_column(cluster = factor(km$cluster))

ggplot(ruspini_clustered, aes(x = x, y = y)) + 
  geom_point(aes(color = cluster))

The clusters form vertical bands across the whole plot. The reason is that the large numeric differences for the x-axis overpower the relatively small differences in the y-axis when distances are calculated. This issue can be avoided by scaling all variables first.

7.8 Outliers in Clustering*

Most clustering algorithms perform complete assignment (i.e., all data points need to be assigned to a cluster). Outliers will affect the clustering.

library(dbscan)

To show this effect, we add a clear outlier to the scaled Ruspini dataset that is 10 standard deviations above the average for the x axis.

ruspini_scaled_outlier <- ruspini_scaled |> add_case(x=10,y=0)

The outlier presents a problem for k-means.

km <- kmeans(ruspini_scaled_outlier, centers = 4, nstart = 10)
ruspini_scaled_outlier_km <- ruspini_scaled_outlier|>
  add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")

ggplot(ruspini_scaled_outlier_km, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = centroids, 
             aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

Often, outliers become their own clusters.

To avoid this problem we can

increasing the number of cluster and removing small clusters in a post-processing step, or
by identify and remove outliers before clustering.

Here is the result with one additional cluster.

km <- kmeans(ruspini_scaled_outlier, centers = 5, nstart = 10)
ruspini_scaled_outlier_km <- ruspini_scaled_outlier|>
  add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")

ggplot(ruspini_scaled_outlier_km, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = centroids, 
             aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

Methods to identify outliers are summary statistics, visual inspection, and outlier scores like the Local Outlier Factor (LOF) discussed in the following sections.

7.8.1 Visual Inspection of the Data

Outliers can be identified using summary statistics, histograms, scatterplots (pairs plots), and boxplots, etc. We use here a pairs plot (the diagonal contains smoothed histograms). The outlier is visible as the single separate point in the scatter plot and as the long tail of the smoothed histogram for x. Remember, in scaled data, we expect most observations to fall in the range $[-3,3]$.

library("GGally")
ggpairs(ruspini_scaled_outlier, progress = FALSE)

The x value of the outlier point is 10 standard deviations larger than the average point. We can find the row number by

which(ruspini_scaled_outlier$x > 3)
## [1] 76

and then manually remove it.

7.8.2 Local Outlier Factor

The Local Outlier Factor (LOF) is related to concepts of DBSCAN can help to identify potential outliers. LOF compares the density of a data point to its neighboring data points. If a data point has a much lower density than its neighbors, it has a high LOF score and is considered an outlier.

It is useful to identify outliers and remove strong outliers prior to clustering. A density based method to identify outlier is LOF (Local Outlier Factor). It is related to dbscan and compares the density around a point with the densities around its neighbors (you have to specify the neighborhood size $k$). The LOF value for a regular data point is 1. The larger the LOF value gets, the more likely the point is an outlier.

To calculate the LOF, a local neighborhood size (MinPts in DBSCAN) for density estimation needs to be chosen. I use 10 here since I expect my clusters to have each more than 10 points.

lof <- lof(ruspini_scaled_outlier, minPts= 10)
lof
##  [1]  0.9319  1.0828  1.0852  1.0830  1.0198  0.9972  1.0340
##  [8]  1.0022  1.1429  0.9385  1.0055  0.9957  1.0340  1.5716
## [15]  1.0396  1.0680  1.0127  1.0881  0.9329  1.0041  1.0008
## [22]  0.9939  0.9778  0.9864  1.4943  1.0708  1.1603  1.5976
## [29]  1.0238  1.2518  1.0062  0.9427  1.0409  0.9536  0.9825
## [36]  1.0515  1.0200  1.6258  1.1750  1.0393  1.0206  1.0702
## [43]  0.9625  0.9758  0.9336  1.0842  0.9815  1.2088  1.0214
## [50]  1.3973  1.2124  0.9982  1.0509  0.9377  0.9411  1.0187
## [57]  0.9920  1.1375  1.0234  0.9175  0.9809  1.0593  1.0213
## [64]  1.2090  0.9928  0.9653  0.9903  0.9767  0.9850  0.9716
## [71]  1.0413  1.0262  1.3682  1.0145  1.1045 17.8995
ggplot(ruspini_scaled_outlier |> add_column(lof = lof), 
       aes(x, y, color = lof)) +
  geom_point() + 
  scale_color_gradient(low = "gray", high = "red")

Plot the points sorted by increasing LOF and look for a knee.

ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)), 
       aes(index, lof)) +
  geom_line()

We choose here a threshold above 2.

ggplot(ruspini_scaled_outlier |> add_column(outlier = lof >= 2), 
       aes(x, y, color = outlier)) +
  geom_point()

You should analyze the found outliers. They are often interesting and important data points. Perform clustering on the regular data without outliers.

ruspini_scaled_clean <- ruspini_scaled_outlier  |> filter(lof < 2)

km <- kmeans(ruspini_scaled_clean, centers = 4, nstart = 10)
ruspini_scaled_clean_km <- ruspini_scaled_clean|>
  add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")

ggplot(ruspini_scaled_clean_km, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = centroids, 
             aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

There are many other outlier removal strategies available. See, e.g., package outliers.

7.9 Exercises*

We will again use the Palmer penguin data for the exercises.

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

Create a R markdown file with the code and discussion for the following below.

What features do you use for clustering? What about missing values? Discuss your answers. Do you need to scale the data before clustering? Why?
What distance measure do you use to reflect similarities between penguins? See Measures of Similarity and Dissimilarity in Chapter 2.
Apply k-means clustering. Use an appropriate method to determine the number of clusters. Compare the clustering using unscaled data and scaled data. What is the difference? Visualize and describe the results.
Apply hierarchical clustering. Create a dendrogram and discuss what it means.
Apply DBSCAN. How do you choose the parameters? How well does it work?

6 Association Analysis: Advanced Concepts

A Data Exploration and Visualization