7 Cluster Analysis

This chapter introduces cluster analysis using K-means, hierarchical clustering and DBSCAN. We will discuss how to choose the number of clusters and how to evaluate the quality clusterings. In addition, we will introduce more clustering algorithms and how clustering is influenced by outliers.

The corresponding chapter of the data mining textbook is available online: Chapter 7: Cluster Analysis: Basic Concepts and Algorithms.

Packages Used in this Chapter

pkgs <- c("cluster", "dbscan", "e1071", "factoextra", "fpc", 
          "GGally", "kernlab", "mclust", "mlbench", "scatterpie", 
          "seriation", "tidyverse")
  
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are:

7.1 Overview

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Clustering is also called unsupervised learning, because it tries to directly learns the structure of the data and does not rely on the availability of a correct answer or class label as supervised learning does. Clustering is often used for exploratory analysis or to preprocess data by grouping.

You can read the free sample chapter from the textbook (Tan, Steinbach, and Kumar 2005): Chapter 7. Cluster Analysis: Basic Concepts and Algorithms

7.1.1 Data Preparation

We will use here a small and very clean toy dataset called Ruspini which is included in the R package cluster.

data(ruspini, package = "cluster")

The Ruspini data set, consisting of 75 points in four groups that is popular for illustrating clustering techniques. It is a very simple data set with well separated clusters. The original dataset has the points ordered by group. We can shuffle the data (rows) using sample_frac which samples by default 100%.

ruspini <- as_tibble(ruspini) |> 
  sample_frac()
ruspini
## # A tibble: 75 × 2
##        x     y
##    <int> <int>
##  1   115   117
##  2   111   126
##  3    85    96
##  4    12    88
##  5    36    72
##  6    28   147
##  7    24    58
##  8    99   128
##  9    46   142
## 10    38   151
## # ℹ 65 more rows

7.1.2 Data cleaning

ggplot(ruspini, aes(x = x, y = y)) + geom_point()
summary(ruspini)
##        x               y        
##  Min.   :  4.0   Min.   :  4.0  
##  1st Qu.: 31.5   1st Qu.: 56.5  
##  Median : 52.0   Median : 96.0  
##  Mean   : 54.9   Mean   : 92.0  
##  3rd Qu.: 76.5   3rd Qu.:141.5  
##  Max.   :117.0   Max.   :156.0

For most clustering algorithms it is necessary to handle missing values and outliers (e.g., remove the observations). For details see Section “Outlier removal” below. This data set has not missing values or strong outlier and looks like it has some very clear groups.

7.1.3 Scale data

Clustering algorithms use distances and the variables with the largest number range will dominate distance calculation. The summary above shows that this is not an issue for the Ruspini dataset with both, x and y, being roughly between 0 and 150. Most data analysts will still scale each column in the data to zero mean and unit standard deviation (z-scores).

Note: The standard scale() function scales a whole data matrix so we implement a function for a single vector and apply it to all numeric columns.

## I use this till tidyverse implements a scale function
scale_numeric <- function(x) {
  x |> mutate(across(where(is.numeric), 
                     function(y) (y - mean(y, na.rm = TRUE)) / sd(y, na.rm = TRUE)))
}
ruspini_scaled <- ruspini |> 
  scale_numeric()
summary(ruspini_scaled)
##        x                 y          
##  Min.   :-1.6681   Min.   :-1.8074  
##  1st Qu.:-0.7665   1st Qu.:-0.7295  
##  Median :-0.0944   Median : 0.0816  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7088   3rd Qu.: 1.0158  
##  Max.   : 2.0366   Max.   : 1.3136

After scaling, most z-scores will fall in the range \([-3,3]\) (z-scores are measured in standard deviations from the mean), where \(0\) means average.

7.2 K-means

k-means implicitly assumes Euclidean distances. We use \(k = 4\) clusters and run the algorithm 10 times with random initialized centroids. The best result is returned.

km <- kmeans(ruspini_scaled, centers = 4, nstart = 10)
km
## K-means clustering with 4 clusters of sizes 23, 17, 15, 20
## 
## Cluster means:
##         x       y
## 1 -0.3595  1.1091
## 2  1.4194  0.4693
## 3  0.4607 -1.4912
## 4 -1.1386 -0.5560
## 
## Clustering vector:
##  [1] 2 2 2 4 4 1 4 2 1 1 1 1 4 3 3 2 4 3 2 1 1 1 2 2 3 2 1 3
## [29] 3 2 1 3 3 4 2 1 3 4 4 3 1 4 1 3 2 4 3 1 2 1 1 2 2 4 4 2
## [57] 1 4 3 4 1 1 2 4 4 4 1 1 3 3 4 1 1 4 4
## 
## Within cluster sum of squares by cluster:
## [1] 2.659 3.641 1.082 2.705
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"       
## [4] "withinss"     "tot.withinss" "betweenss"   
## [7] "size"         "iter"         "ifault"

km is an R object implemented as a list.

str(km)
## List of 9
##  $ cluster     : int [1:75] 2 2 2 4 4 1 4 2 1 1 ...
##  $ centers     : num [1:4, 1:2] -0.36 1.419 0.461 -1.139 1.109 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:2] "x" "y"
##  $ totss       : num 148
##  $ withinss    : num [1:4] 2.66 3.64 1.08 2.71
##  $ tot.withinss: num 10.1
##  $ betweenss   : num 138
##  $ size        : int [1:4] 23 17 15 20
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The clustering vector is just a list element containing the cluster assignment for each data row and can be accessed using km$cluster. I add the cluster assignment as a column to the original dataset (I make it a factor since it represents a nominal label).

ruspini_clustered <- ruspini |> 
  add_column(cluster = factor(km$cluster))
ruspini_clustered
## # A tibble: 75 × 3
##        x     y cluster
##    <int> <int> <fct>  
##  1   115   117 2      
##  2   111   126 2      
##  3    85    96 2      
##  4    12    88 4      
##  5    36    72 4      
##  6    28   147 1      
##  7    24    58 4      
##  8    99   128 2      
##  9    46   142 1      
## 10    38   151 1      
## # ℹ 65 more rows
ggplot(ruspini_clustered, aes(x = x, y = y)) + 
  geom_point(aes(color = cluster))

Add the centroids to the plot. The centroids are scaled, so we need to unscale them to plot them over the original data. The second geom_points uses not the original data but specifies the centroids as its dataset.

unscale <- function(x, original_data) {
  if (ncol(x) != ncol(original_data))
    stop("Function needs matching columns!")
  x * matrix(apply(original_data, MARGIN = 2, sd, na.rm = TRUE), 
             byrow = TRUE, nrow = nrow(x), ncol = ncol(x)) + 
    matrix(apply(original_data, MARGIN = 2, mean, na.rm = TRUE), 
           byrow = TRUE, nrow = nrow(x), ncol = ncol(x))
}

centroids <- km$centers %>% unscale(original_data = ruspini) %>% as_tibble(rownames = "cluster")
centroids
## # A tibble: 4 × 3
##   cluster     x     y
##   <chr>   <dbl> <dbl>
## 1 1        43.9 146. 
## 2 2        98.2 115. 
## 3 3        68.9  19.4
## 4 4        20.2  65.0
ggplot(ruspini_clustered, aes(x = x, y = y)) + 
  geom_point(aes(color = cluster)) +
  geom_point(data = centroids, aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

The factoextra package provides also a good visualization with object labels and ellipses for clusters.

library(factoextra)
fviz_cluster(km, data = ruspini_scaled, centroids = TRUE, 
             repel = TRUE, ellipse.type = "norm")

7.2.1 Inspect Clusters

We inspect the clusters created by the 4-cluster k-means solution. The following code can be adapted to be used for other clustering methods.

Inspect the centroids with horizontal bar charts organized by cluster. To group the plots by cluster, we have to change the data format to the “long”-format using a pivot operation. I use colors to match the clusters in the scatter plots.

ggplot(pivot_longer(centroids, 
                    cols = c(x, y), 
                    names_to = "feature"),
  #aes(x = feature, y = value, fill = cluster)) +
  aes(x = value, y = feature, fill = cluster)) +
  geom_bar(stat = "identity") +
  facet_grid(cols = vars(cluster))

7.2.2 Extract a Single Cluster

You need is to filter the rows corresponding to the cluster index. The next example calculates summary statistics and then plots all data points of cluster 1.

cluster1 <- ruspini_clustered |> 
  filter(cluster == 1)
cluster1
## # A tibble: 23 × 3
##        x     y cluster
##    <int> <int> <fct>  
##  1    28   147 1      
##  2    46   142 1      
##  3    38   151 1      
##  4    38   145 1      
##  5    50   142 1      
##  6    33   154 1      
##  7    55   155 1      
##  8    52   152 1      
##  9    32   149 1      
## 10    54   124 1      
## # ℹ 13 more rows
summary(cluster1)
##        x              y       cluster
##  Min.   :28.0   Min.   :124   1:23   
##  1st Qu.:36.5   1st Qu.:142   2: 0   
##  Median :44.0   Median :147   3: 0   
##  Mean   :43.9   Mean   :146   4: 0   
##  3rd Qu.:51.0   3rd Qu.:152          
##  Max.   :63.0   Max.   :156
ggplot(cluster1, aes(x = x, y = y)) + geom_point() +
  coord_cartesian(xlim = c(0, 120), ylim = c(0, 160))

What happens if we try to cluster with 8 centers?

fviz_cluster(kmeans(ruspini_scaled, centers = 8), data = ruspini_scaled,
  centroids = TRUE,  geom = "point", ellipse.type = "norm")

7.3 Agglomerative Hierarchical Clustering

Hierarchical clustering starts with a distance matrix. dist() defaults to method = "euclidean".

Notes:

  • Distance matrices become very large quickly (the space complexity is \(O(n^2)\) where \(n\) is the number if data points). It is only possible to calculate and store the matrix for small to medium data sets (maybe a few hundred thousand data points) in main memory. If your data is too large then you can use sampling to reduce the number of points to cluster.
  • The data needs to be scaled since we compute distances.

7.3.1 Creating a Dendrogram

d <- dist(ruspini_scaled)

hclust() implements agglomerative hierarchical clustering. The linkage function defines how the distance between two clusters (sets of points) is calculated. We specify the complete-link method where the distance between two groups is measured by the distance between the two farthest apart points in the two groups.

hc <- hclust(d, method = "complete")

Hierarchical clustering does not return cluster assignments but a dendrogram object which shows how the objects on the x-axis are successively joined together when going up along the y-axis. The y-axis represents the distance at which groups of points are joined together. The standard plot function displays the dendrogram.

plot(hc)

The package factoextra provides a ggplot function to plot dendrograms.

More plotting options for dendrograms, including plotting parts of large dendrograms can be found here.

7.3.2 Extracting a Partitional Clustering

Partitional clusterings with cluster assignments can be extracted from a dendrogram by cutting the dendrogram horizontally using cutree(). Here we cut it into four parts and add the cluster id to the data.

clusters <- cutree(hc, k = 4)
cluster_complete <- ruspini_scaled |>
  add_column(cluster = factor(clusters))
cluster_complete
## # A tibble: 75 × 3
##         x       y cluster
##     <dbl>   <dbl> <fct>  
##  1  1.97   0.513  1      
##  2  1.84   0.698  1      
##  3  0.987  0.0816 1      
##  4 -1.41  -0.0827 2      
##  5 -0.619 -0.411  2      
##  6 -0.881  1.13   3      
##  7 -1.01  -0.699  2      
##  8  1.45   0.739  1      
##  9 -0.291  1.03   3      
## 10 -0.553  1.21   3      
## # ℹ 65 more rows

The 4 different clusters can be shown in the dendrogram using different colors.

fviz_dend(hc, k = 4)

We can also color in the scatterplot.

ggplot(cluster_complete, aes(x, y, color = cluster)) +
  geom_point()

The package factoextra provides an alternative visualization. Since it needs a partition object as the first argument, we create one as a simple list.

fviz_cluster(list(data = ruspini_scaled, 
                  cluster = cutree(hc, k = 4)), 
             geom = "point")

Here are the results if we use the single-link method to join clusters. Single-link measures the distance between two groups by the by the distance between the two closest points from the two groups.

hc_single <- hclust(d, method = "single")
fviz_dend(hc_single, k = 4)
fviz_cluster(list(data = ruspini_scaled, 
                  cluster = cutree(hc_single, k = 4)), 
             geom = "point")

The most important difference between complete-link and single-link is that complete-link prefers globular clusters and single-link shows chaining (see the staircase pattern in the dendrogram).

7.4 DBSCAN

DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise.” It groups together points that are closely packed together and treats points in low-density regions as outliers. DBSCAN is implemented in package dbscan.

library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

7.4.1 DBSCAN Parameters

DBSCAN has two parameters that interact with each other. Changing one typically means that the other one also has to be adjusted.

minPts defines how many points are needed in the epsilon neighborhood to make a point a core point for a cluster. It is often chosen as a smoothing parameter, where larger values smooth the density estimation more but also ignore smaller structures with less then minPts.

eps is the radius of the epsilon neighborhood around a point in which the number of points is counted.

Users typically first select minPts. We use here minPts = 4. To decide on epsilon, the knee in the kNN distance plot is often used. All points are sorted by their kNN distance. The idea is that points with a low kNN distance are located in a dense area which should become a cluster. Points with a large kNN distance are in a low density area and most likely represent outliers and noise.

kNNdistplot(ruspini_scaled, minPts = 4)
abline(h = .32, col = "red")

Note that minPts contains the point itself, while the k-nearest neighbor distance calculation does not. The plot uses therefor k = minPts - 1.

The knee is visible around eps = .32 (shown by the manually added red line). The points to the left of the intersection of the k-nearest neighbor distance line and the red line are the points that will be core points during clustering at the specified parameters.

7.4.2 Cluster using DBSCAN

Clustering with dbscan() returns a dbscan object.

db <- dbscan(ruspini_scaled, eps = .32, minPts = 4)
db
## DBSCAN clustering for 75 objects.
## Parameters: eps = 0.32, minPts = 4
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 5 noise points.
## 
##  0  1  2  3  4 
##  5 12 20 23 15 
## 
## Available fields: cluster, eps, minPts, metric,
##                   borderPoints
str(db)
## List of 5
##  $ cluster     : int [1:75] 1 1 0 2 2 3 2 1 3 3 ...
##  $ eps         : num 0.32
##  $ minPts      : num 4
##  $ metric      : chr "euclidean"
##  $ borderPoints: logi TRUE
##  - attr(*, "class")= chr [1:2] "dbscan_fast" "dbscan"

The cluster element is the cluster assignment with cluster labels. The special cluster label 0 means that the point is noise and not assigned to a cluster.

ggplot(ruspini_scaled |> add_column(cluster = factor(db$cluster)),
  aes(x, y, color = cluster)) + geom_point()

DBSCAN found 5 noise points. fviz_cluster() can be used to create a ggplot visualization.

fviz_cluster(db, ruspini_scaled, geom = "point")

Different values for minPts and eps can lead to vastly different clusterings. Often, a misspecification leads to all points being noise points or a single cluster. Also, if the clusters are not well separated, then DBSCAN has a hard time splitting them.

7.5 Cluster Evaluation

7.5.1 Unsupervised Cluster Evaluation

This is also often called internal cluster evaluation since it does not use extra external labels. The two most popular quality metrics are the within-cluster sum of squares (WCSS) used as the optimization objective by \(k\)-means and the average silhouette width. Look at within.cluster.ss and avg.silwidth below.

##library(fpc)

Notes:

  • I do not load fpc since the NAMESPACE overwrites dbscan.
  • The clustering (second argument below) has to be supplied as a vector with numbers (cluster IDs) and cannot be a factor (use as.integer() to convert the factor to an ID).
fpc::cluster.stats(d, km$cluster)
## $n
## [1] 75
## 
## $cluster.number
## [1] 4
## 
## $cluster.size
## [1] 23 17 15 20
## 
## $min.cluster.size
## [1] 15
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.1591 1.4627 0.8359 1.1193
## 
## $average.distance
## [1] 0.4286 0.5806 0.3564 0.4824
## 
## $median.distance
## [1] 0.3934 0.5024 0.3380 0.4492
## 
## $separation
## [1] 0.7676 0.7676 1.1577 1.1577
## 
## $average.toother
## [1] 2.149 2.293 2.308 2.157
## 
## $separation.matrix
##        [,1]   [,2]  [,3]  [,4]
## [1,] 0.0000 0.7676 1.958 1.220
## [2,] 0.7676 0.0000 1.308 1.340
## [3,] 1.9577 1.3084 0.000 1.158
## [4,] 1.2199 1.3397 1.158 0.000
## 
## $ave.between.matrix
##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.000 1.925 2.750 1.887
## [2,] 1.925 0.000 2.220 2.772
## [3,] 2.750 2.220 0.000 1.874
## [4,] 1.887 2.772 1.874 0.000
## 
## $average.between
## [1] 2.219
## 
## $average.within
## [1] 0.463
## 
## $n.between
## [1] 2091
## 
## $n.within
## [1] 684
## 
## $max.diameter
## [1] 1.463
## 
## $min.separation
## [1] 0.7676
## 
## $within.cluster.ss
## [1] 10.09
## 
## $clus.avg.silwidths
##      1      2      3      4 
## 0.7455 0.6813 0.8074 0.7211 
## 
## $avg.silwidth
## [1] 0.7368
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.8416
## 
## $dunn
## [1] 0.5248
## 
## $dunn2
## [1] 3.228
## 
## $entropy
## [1] 1.373
## 
## $wb.ratio
## [1] 0.2086
## 
## $ch
## [1] 323.6
## 
## $cwidegap
## [1] 0.3153 0.4150 0.2352 0.2612
## 
## $widestgap
## [1] 0.415
## 
## $sindex
## [1] 0.8583
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Read ? cluster.stats for an explanation of all the available indices.

sapply(
  list(
    km = km$cluster,
    hc_compl = cutree(hc, k = 4),
    hc_single = cutree(hc_single, k = 4)
  ),
  FUN = function(x)
    fpc::cluster.stats(d, x))[c("within.cluster.ss", "avg.silwidth"), ]
##                   km     hc_compl hc_single
## within.cluster.ss 10.09  10.09    10.09    
## avg.silwidth      0.7368 0.7368   0.7368

Next, we look at the silhouette using a silhouette plot.

library(cluster)
## 
## Attaching package: 'cluster'
## The following object is masked _by_ '.GlobalEnv':
## 
##     ruspini
plot(silhouette(km$cluster, d))

Note: The silhouette plot does not show correctly in R Studio if you have too many objects (bars are missing). I will work when you open a new plotting device with windows(), x11() or quartz().

ggplot visualization using factoextra

fviz_silhouette(silhouette(km$cluster, d))
##   cluster size ave.sil.width
## 1       1   23          0.75
## 2       2   17          0.68
## 3       3   15          0.81
## 4       4   20          0.72

7.5.2 Unsupervised Cluster Evaluation using the Proximity Matrix

ggplot(ruspini_scaled, 
       aes(x, y, color = factor(km$cluster))) + 
  geom_point()
d <- dist(ruspini_scaled)

Inspect the distance matrix between the first 5 objects.

as.matrix(d)[1:5, 1:5]
##        1      2     3      4      5
## 1 0.0000 0.2266 1.074 3.4289 2.7498
## 2 0.2266 0.0000 1.052 3.3381 2.6972
## 3 1.0739 1.0517 0.000 2.3989 1.6803
## 4 3.4289 3.3381 2.399 0.0000 0.8527
## 5 2.7498 2.6972 1.680 0.8527 0.0000

A false-color image visualizes each value in the matrix as a pixel with the color representing the value.

Rows and columns are the objects as they are ordered in the data set. The diagonal represents the distance between an object and itself and has by definition a distance of 0 (dark line). Visualizing the unordered distance matrix does not show much structure, but we can reorder the matrix (rows and columns) using the k-means cluster labels from cluster 1 to 4. A clear block structure representing the clusters becomes visible.

pimage(d, order=order(km$cluster), col = bluered(100))

Plot function dissplot in package seriation rearranges the matrix and adds lines and cluster labels. In the lower half of the plot, it shows average dissimilarities between clusters. The function organizes the objects by cluster and then reorders clusters and objects within clusters so that more similar objects are closer together.

dissplot(d, labels = km$cluster, 
         options = list(main = "k-means with k=4"))

The reordering by dissplot makes the misspecification of k visible as blocks.

dissplot(d, 
         labels = kmeans(ruspini_scaled, centers = 3)$cluster, 
         col = bluered(100))
dissplot(d, 
         labels = kmeans(ruspini_scaled, centers = 9)$cluster, 
         col = bluered(100))

Using factoextra

7.5.3 Determining the Correct Number of Clusters

ggplot(ruspini_scaled, aes(x, y)) + geom_point()
## We will use different methods and try 1-10 clusters.
set.seed(1234)
ks <- 2:10

7.5.3.1 Elbow Method: Within-Cluster Sum of Squares

Calculate the within-cluster sum of squares for different numbers of clusters and look for the knee or elbow in the plot. (nstart = 5 just repeats k-means 5 times and returns the best solution)

WCSS <- sapply(ks, FUN = function(k) {
  kmeans(ruspini_scaled, centers = k, nstart = 5)$tot.withinss
  })

ggplot(tibble(ks, WCSS), aes(ks, WCSS)) + 
  geom_line() +
  geom_vline(xintercept = 4, color = "red", linetype = 2)

7.5.3.2 Average Silhouette Width

Plot the average silhouette width for different number of clusters and look for the maximum in the plot.

ASW <- sapply(ks, FUN=function(k) {
  fpc::cluster.stats(d, 
                     kmeans(ruspini_scaled, 
                            centers = k, 
                            nstart = 5)$cluster)$avg.silwidth
  })

best_k <- ks[which.max(ASW)]
best_k
## [1] 4
ggplot(tibble(ks, ASW), aes(ks, ASW)) + 
  geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

7.5.3.3 Dunn Index

Use Dunn index (another internal measure given by min. separation/ max. diameter)

DI <- sapply(ks, FUN = function(k) {
  fpc::cluster.stats(d, 
                     kmeans(ruspini_scaled, centers = k, 
                            nstart = 5)$cluster)$dunn
})

best_k <- ks[which.max(DI)]
ggplot(tibble(ks, DI), aes(ks, DI)) + 
  geom_line() +
  geom_vline(xintercept = best_k, color = "red", linetype = 2)

7.5.3.4 Gap Statistic

Compares the change in within-cluster dispersion with that expected from a null model (see ? clusGap). The default method is to choose the smallest k such that its value Gap(k) is not more than 1 standard error away from the first local maximum.

library(cluster)
k <- clusGap(ruspini_scaled, 
             FUN = kmeans,  
             nstart = 10, 
             K.max = 10)
k
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = ruspini_scaled, FUNcluster = kmeans, K.max = 10, nstart = 10)
## B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##        logW E.logW      gap  SE.sim
##  [1,] 3.498  3.467 -0.03080 0.03570
##  [2,] 3.074  3.150  0.07625 0.03744
##  [3,] 2.678  2.903  0.22466 0.03796
##  [4,] 2.106  2.704  0.59710 0.03633
##  [5,] 1.987  2.570  0.58271 0.03474
##  [6,] 1.864  2.451  0.58713 0.03651
##  [7,] 1.732  2.348  0.61558 0.03954
##  [8,] 1.640  2.256  0.61570 0.04132
##  [9,] 1.561  2.172  0.61129 0.04090
## [10,] 1.513  2.093  0.57987 0.03934
plot(k)

Note: these methods can also be used for hierarchical clustering.

There have been many other methods and indices proposed to determine the number of clusters. See, e.g., package NbClust.

7.5.4 Clustering Tendency

Most clustering algorithms will always produce a clustering, even if the data does not contain a cluster structure. It is typically good to check cluster tendency before attempting to cluster the data.

We use again the smiley data.

library(mlbench)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)$x
colnames(shapes) <- c("x", "y")
shapes <- as_tibble(shapes)

7.5.4.1 Scatter plots

The first step is visual inspection using scatter plots.

ggplot(shapes, aes(x = x, y = y)) + geom_point()

Cluster tendency is typically indicated by several separated point clouds. Often an appropriate number of clusters can also be visually obtained by counting the number of point clouds. We see four clusters, but the mouth is not convex/spherical and thus will pose a problems to algorithms like k-means.

If the data has more than two features then you can use a pairs plot (scatterplot matrix) or look at a scatterplot of the first two principal components using PCA.

7.5.4.2 Visual Analysis for Cluster Tendency Assessment (VAT)

VAT reorders the objects to show potential clustering tendency as a block structure (dark blocks along the main diagonal). We scale the data before using Euclidean distance.

library(seriation)

d_shapes <- dist(scale(shapes))
VAT(d_shapes, col = bluered(100))

iVAT uses the largest distances for all possible paths between two objects instead of the direct distances to make the block structure better visible.

iVAT(d_shapes, col = bluered(100))

7.5.4.3 Hopkins statistic

factoextra can also create a VAT plot and calculate the Hopkins statistic to assess clustering tendency. For the Hopkins statistic, a sample of size \(n\) is drawn from the data and then compares the nearest neighbor distribution with a simulated dataset drawn from a random uniform distribution (see detailed explanation). A values >.5 indicates usually a clustering tendency.

get_clust_tendency(shapes, n = 10)
## $hopkins_stat
## [1] 0.9074
## 
## $plot

Both plots show a strong cluster structure with 4 clusters.

7.5.4.4 Data Without Clustering Tendency

data_random <- tibble(x = runif(500), y = runif(500))
ggplot(data_random, aes(x, y)) + geom_point()

No point clouds are visible, just noise.

d_random <- dist(data_random)
VAT(d_random, col = bluered(100))
iVAT(d_random, col = bluered(100))
get_clust_tendency(data_random, n = 10, graph = FALSE)
## $hopkins_stat
## [1] 0.4642
## 
## $plot
## NULL

There is very little clustering structure visible indicating low clustering tendency and clustering should not be performed on this data. However, k-means can be used to partition the data into \(k\) regions of roughly equivalent size. This can be used as a data-driven discretization of the space.

7.5.4.5 k-means on Data Without Clustering Tendency

What happens if we perform k-means on data that has no inherent clustering structure?

km <- kmeans(data_random, centers = 4)

random_clustered<- data_random |> 
  add_column(cluster = factor(km$cluster))
ggplot(random_clustered, aes(x = x, y = y, color = cluster)) + 
  geom_point()

k-means discretizes the space into similarly sized regions.

7.5.5 Supervised Measures of Cluster Validity

Also called external cluster validation since it uses ground truth information. That is, the user has an idea how the data should be grouped. This could be a known class label not provided to the clustering algorithm.

We use an artificial data set with known groups.

library(mlbench)
set.seed(1234)
shapes <- mlbench.smiley(n = 500, sd1 = 0.1, sd2 = 0.05)
plot(shapes)

Prepare data

truth <- as.integer(shapes$class)
shapes <- shapes$x
colnames(shapes) <- c("x", "y")

shapes <- shapes |> scale() |> as_tibble()

ggplot(shapes, aes(x, y)) + geom_point()

Find optimal number of Clusters for k-means

ks <- 2:20

Use within sum of squares (look for the knee)

WCSS <- sapply(ks, FUN = function(k) {
  kmeans(shapes, centers = k, nstart = 10)$tot.withinss
})

ggplot(tibble(ks, WCSS), aes(ks, WCSS)) + geom_line()

Looks like it could be 7 clusters

km <- kmeans(shapes, centers = 7, nstart = 10)

ggplot(shapes |> add_column(cluster = factor(km$cluster)), 
       aes(x, y, color = cluster)) +
  geom_point()

Hierarchical clustering: We use single-link because of the mouth is non-convex and chaining may help.

d <- dist(shapes)
hc <- hclust(d, method = "single")

Find optimal number of clusters

ASW <- sapply(ks, FUN = function(k) {
  fpc::cluster.stats(d, cutree(hc, k))$avg.silwidth
})

ggplot(tibble(ks, ASW), aes(ks, ASW)) + geom_line()

The maximum is clearly at 4 clusters.

hc_4 <- cutree(hc, 4)

ggplot(shapes |> add_column(cluster = factor(hc_4)), 
       aes(x, y, color = cluster)) +
  geom_point()

Compare with ground truth with the corrected (=adjusted) Rand index (ARI), the variation of information (VI) index, entropy and purity.

cluster_stats computes ARI and VI as comparative measures. I define functions for entropy and purity here:

entropy <- function(cluster, truth) {
  k <- max(cluster, truth)
  cluster <- factor(cluster, levels = 1:k)
  truth <- factor(truth, levels = 1:k)
  w <- table(cluster)/length(cluster)

  cnts <- sapply(split(truth, cluster), table)
  p <- sweep(cnts, 1, rowSums(cnts), "/")
  p[is.nan(p)] <- 0
  e <- -p * log(p, 2)

  sum(w * rowSums(e, na.rm = TRUE))
}

purity <- function(cluster, truth) {
  k <- max(cluster, truth)
  cluster <- factor(cluster, levels = 1:k)
  truth <- factor(truth, levels = 1:k)
  w <- table(cluster)/length(cluster)

  cnts <- sapply(split(truth, cluster), table)
  p <- sweep(cnts, 1, rowSums(cnts), "/")
  p[is.nan(p)] <- 0

  sum(w * apply(p, 1, max))
}

calculate measures (for comparison we also use random “clusterings” with 4 and 6 clusters)

random_4 <- sample(1:4, nrow(shapes), replace = TRUE)
random_6 <- sample(1:6, nrow(shapes), replace = TRUE)

r <- rbind(
  kmeans_7 = c(
    unlist(fpc::cluster.stats(d, km$cluster, 
                              truth, compareonly = TRUE)),
    entropy = entropy(km$cluster, truth),
    purity = purity(km$cluster, truth)
    ),
  hc_4 = c(
    unlist(fpc::cluster.stats(d, hc_4, 
                              truth, compareonly = TRUE)),
    entropy = entropy(hc_4, truth),
    purity = purity(hc_4, truth)
    ),
  random_4 = c(
    unlist(fpc::cluster.stats(d, random_4, 
                              truth, compareonly = TRUE)),
    entropy = entropy(random_4, truth),
    purity = purity(random_4, truth)
    ),
  random_6 = c(
    unlist(fpc::cluster.stats(d, random_6, 
                              truth, compareonly = TRUE)),
    entropy = entropy(random_6, truth),
    purity = purity(random_6, truth)
    )
  )
r
##          corrected.rand     vi entropy purity
## kmeans_7       0.638229 0.5709  0.2286 0.4639
## hc_4           1.000000 0.0000  0.0000 1.0000
## random_4      -0.003235 2.6832  1.9883 0.2878
## random_6      -0.002125 3.0763  1.7281 0.1436

Notes:

  • Hierarchical clustering found the perfect clustering.
  • Entropy and purity are heavily impacted by the number of clusters (more clusters improve the metric).
  • The corrected rand index shows clearly that the random clusterings have no relationship with the ground truth (very close to 0). This is a very helpful property.

Read ? cluster.stats for an explanation of all the available indices.

7.6 More Clustering Algorithms*

Note: Some of these methods are covered in Chapter 8 of the textbook.

7.6.1 Partitioning Around Medoids (PAM)

PAM tries to solve the \(k\)-medoids problem. The problem is similar to \(k\)-means, but uses medoids instead of centroids to represent clusters. Like hierarchical clustering, it typically works with precomputed distance matrix. An advantage is that you can use any distance metric not just Euclidean distances. Note: The medoid is the most central data point in the middle of the cluster.

library(cluster)

d <- dist(ruspini_scaled)
str(d)
##  'dist' num [1:2775] 0.227 1.074 3.429 2.75 2.918 ...
##  - attr(*, "Size")= int 75
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "Euclidean"
##  - attr(*, "call")= language dist(x = ruspini_scaled)
p <- pam(d, k = 4)
p
## Medoids:
##      ID   
## [1,] 49 49
## [2,] 46 46
## [3,] 41 41
## [4,] 44 44
## Clustering vector:
##  [1] 1 1 1 2 2 3 2 1 3 3 3 3 2 4 4 1 2 4 1 3 3 3 1 1 4 1 3 4
## [29] 4 1 3 4 4 2 1 3 4 2 2 4 3 2 3 4 1 2 4 3 1 3 3 1 1 2 2 1
## [57] 3 2 4 2 3 3 1 2 2 2 3 3 4 4 2 3 3 2 2
## Objective function:
##  build   swap 
## 0.4423 0.3187 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective" 
## [5] "isolation"  "clusinfo"   "silinfo"    "diss"      
## [9] "call"
ruspini_clustered <- ruspini_scaled |> 
  add_column(cluster = factor(p$cluster))

medoids <- as_tibble(ruspini_scaled[p$medoids, ], 
                     rownames = "cluster")
medoids
## # A tibble: 4 × 3
##   cluster      x      y
##   <chr>    <dbl>  <dbl>
## 1 1        1.45   0.554
## 2 2       -1.18  -0.555
## 3 3       -0.357  1.17 
## 4 4        0.463 -1.46
ggplot(ruspini_clustered, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = medoids, aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)
## __Note:__ `fviz_cluster` needs the original data.
fviz_cluster(c(p, list(data = ruspini_scaled)), 
             geom = "point", 
             ellipse.type = "norm")

7.6.2 Gaussian Mixture Models

library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
## 
##     map

Gaussian mixture models assume that the data set is the result of drawing data from a set of Gaussian distributions where each distribution represents a cluster. Estimation algorithms try to identify the location parameters of the distributions and thus can be used to find clusters. Mclust() uses Bayesian Information Criterion (BIC) to find the number of clusters (model selection). BIC uses the likelihood and a penalty term to guard against overfitting.

m <- Mclust(ruspini_scaled)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 5
## components: 
## 
##  log-likelihood  n df    BIC    ICL
##          -91.26 75 16 -251.6 -251.7
## 
## Clustering table:
##  1  2  3  4  5 
## 14  3 20 23 15
plot(m, what = "classification")

Rerun with a fixed number of 4 clusters

m <- Mclust(ruspini_scaled, G=4)
summary(m)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 4
## components: 
## 
##  log-likelihood  n df    BIC    ICL
##          -101.6 75 13 -259.3 -259.3
## 
## Clustering table:
##  1  2  3  4 
## 17 20 23 15
plot(m, what = "classification")

7.6.3 Spectral clustering

Spectral clustering works by embedding the data points of the partitioning problem into the subspace of the k largest eigenvectors of a normalized affinity/kernel matrix. Then uses a simple clustering method like k-means.

library("kernlab")
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:arulesSequences':
## 
##     size
## The following object is masked from 'package:scales':
## 
##     alpha
## The following object is masked from 'package:arules':
## 
##     size
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
cluster_spec <- specc(as.matrix(ruspini_scaled), centers = 4)
cluster_spec
## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 3 3 3 4 4 1 4 3 1 1 1 1 4 2 2 3 4 2 3 1 1 1 3 3 2 3 1 2 2 3 1 2 2 4 3 1 2 4 4 2 1 4 1 2 3 4 2 1 3 1 1 3 3 4 4 3 1 4 2 4 1 1 3 4 4 4 1 1 2 2 4 1 1 4 4 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  22.6352409453921 
## 
## Centers:  
##         [,1]    [,2]
## [1,] -0.3595  1.1091
## [2,]  0.4607 -1.4912
## [3,]  1.4194  0.4693
## [4,] -1.1386 -0.5560
## 
## Cluster size:  
## [1] 23 15 17 20
## 
## Within-cluster sum of squares:  
## [1] 52.001 49.505 15.880  9.855
ggplot(ruspini_scaled |> 
         add_column(cluster = factor(cluster_spec)),
       aes(x, y, color = cluster)) + 
  geom_point()

7.6.4 Fuzzy C-Means Clustering

The fuzzy clustering version of the k-means clustering problem. Each data point has a degree of membership to for each cluster.

library("e1071")

cluster_cmeans <- cmeans(as.matrix(ruspini_scaled), centers = 4)
cluster_cmeans
## Fuzzy c-means clustering with 4 clusters
## 
## Cluster centers:
##         x       y
## 1 -0.3763  1.1143
## 2 -1.1371 -0.5550
## 3  0.4552 -1.4760
## 4  1.5048  0.5161
## 
## Memberships:
##               1         2         3         4
##  [1,] 3.390e-02 0.0184315 0.0318371 9.158e-01
##  [2,] 2.683e-02 0.0130798 0.0205431 9.395e-01
##  [3,] 1.100e-01 0.0654667 0.1188610 7.056e-01
##  [4,] 9.817e-02 0.8288408 0.0452756 2.771e-02
##  [5,] 9.312e-02 0.7683796 0.0971244 4.138e-02
##  [6,] 8.622e-01 0.0758548 0.0256694 3.626e-02
##  [7,] 9.546e-03 0.9731667 0.0127762 4.511e-03
##  [8,] 1.483e-02 0.0061528 0.0087250 9.703e-01
##  [9,] 9.889e-01 0.0046259 0.0021818 4.268e-03
## [10,] 9.753e-01 0.0114717 0.0048175 8.409e-03
## [11,] 9.787e-01 0.0103280 0.0041355 6.879e-03
## [12,] 9.597e-01 0.0151615 0.0078892 1.728e-02
## [13,] 2.687e-02 0.9343464 0.0272600 1.153e-02
## [14,] 2.343e-02 0.0384835 0.8921799 4.591e-02
## [15,] 1.016e-02 0.0183426 0.9537436 1.775e-02
## [16,] 2.457e-03 0.0011603 0.0018407 9.945e-01
## [17,] 3.663e-02 0.8900877 0.0549646 1.831e-02
## [18,] 1.136e-02 0.0249347 0.9472522 1.646e-02
## [19,] 1.739e-01 0.1075081 0.1773901 5.412e-01
## [20,] 9.245e-01 0.0371563 0.0146304 2.371e-02
## [21,] 8.916e-01 0.0333605 0.0199925 5.510e-02
## [22,] 9.396e-01 0.0204596 0.0114691 2.852e-02
## [23,] 4.091e-02 0.0229225 0.0405208 8.957e-01
## [24,] 7.444e-03 0.0032213 0.0047480 9.846e-01
## [25,] 9.526e-03 0.0254625 0.9531437 1.187e-02
## [26,] 2.177e-01 0.1283064 0.1837577 4.703e-01
## [27,] 9.187e-01 0.0419859 0.0155199 2.380e-02
## [28,] 1.087e-02 0.0205930 0.9503614 1.818e-02
## [29,] 1.347e-03 0.0031668 0.9936328 1.853e-03
## [30,] 5.072e-04 0.0002500 0.0004109 9.988e-01
## [31,] 7.514e-01 0.0920641 0.0519039 1.047e-01
## [32,] 9.017e-03 0.0173387 0.9591007 1.454e-02
## [33,] 1.664e-03 0.0038610 0.9921769 2.298e-03
## [34,] 3.001e-02 0.9148709 0.0405049 1.461e-02
## [35,] 1.130e-02 0.0058690 0.0099612 9.729e-01
## [36,] 9.103e-01 0.0483879 0.0168066 2.448e-02
## [37,] 1.727e-02 0.0498112 0.9125435 2.038e-02
## [38,] 5.521e-02 0.8604295 0.0593827 2.497e-02
## [39,] 5.307e-02 0.8953172 0.0336329 1.798e-02
## [40,] 1.083e-02 0.0287381 0.9470194 1.341e-02
## [41,] 9.977e-01 0.0009642 0.0004512 8.878e-04
## [42,] 2.538e-02 0.9426475 0.0220730 9.897e-03
## [43,] 7.551e-01 0.0672204 0.0448198 1.329e-01
## [44,] 5.053e-05 0.0001096 0.9997656 7.424e-05
## [45,] 9.619e-02 0.0392619 0.0536219 8.109e-01
## [46,] 4.470e-04 0.9989329 0.0004363 1.838e-04
## [47,] 3.346e-03 0.0082407 0.9839904 4.422e-03
## [48,] 9.923e-01 0.0033856 0.0014976 2.768e-03
## [49,] 1.324e-03 0.0006092 0.0009437 9.971e-01
## [50,] 7.016e-01 0.0713856 0.0509694 1.760e-01
## [51,] 9.467e-01 0.0256411 0.0103562 1.729e-02
## [52,] 1.387e-02 0.0075739 0.0135384 9.650e-01
## [53,] 8.074e-03 0.0034561 0.0050418 9.834e-01
## [54,] 2.248e-02 0.9302629 0.0358208 1.143e-02
## [55,] 4.917e-02 0.8877345 0.0431078 1.999e-02
## [56,] 1.267e-01 0.0394178 0.0461292 7.877e-01
## [57,] 9.754e-01 0.0120663 0.0047505 7.762e-03
## [58,] 2.256e-02 0.9540764 0.0155193 7.844e-03
## [59,] 1.396e-03 0.0030791 0.9934965 2.029e-03
## [60,] 1.357e-02 0.9712826 0.0102384 4.907e-03
## [61,] 9.293e-01 0.0376080 0.0133145 1.973e-02
## [62,] 9.272e-01 0.0247817 0.0139384 3.405e-02
## [63,] 1.929e-02 0.0106791 0.0192351 9.508e-01
## [64,] 3.837e-02 0.8665080 0.0740350 2.108e-02
## [65,] 3.143e-03 0.9920954 0.0034025 1.359e-03
## [66,] 1.710e-02 0.9626089 0.0138085 6.485e-03
## [67,] 9.763e-01 0.0095410 0.0046349 9.540e-03
## [68,] 9.885e-01 0.0044831 0.0022370 4.751e-03
## [69,] 1.011e-02 0.0241498 0.9523630 1.337e-02
## [70,] 1.111e-02 0.0203874 0.9492340 1.926e-02
## [71,] 4.507e-02 0.9045022 0.0339790 1.645e-02
## [72,] 9.964e-01 0.0015597 0.0007050 1.321e-03
## [73,] 9.693e-01 0.0112508 0.0059273 1.354e-02
## [74,] 4.260e-02 0.8727568 0.0633457 2.130e-02
## [75,] 2.104e-02 0.9397664 0.0290866 1.010e-02
## 
## Closest hard clustering:
##  [1] 4 4 4 2 2 1 2 4 1 1 1 1 2 3 3 4 2 3 4 1 1 1 4 4 3 4 1 3
## [29] 3 4 1 3 3 2 4 1 3 2 2 3 1 2 1 3 4 2 3 1 4 1 1 4 4 2 2 4
## [57] 1 2 3 2 1 1 4 2 2 2 1 1 3 3 2 1 1 2 2
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership" 
## [5] "iter"        "withinerror" "call"

Plot membership (shown as small pie charts)

library("scatterpie")
## scatterpie v0.2.4 Learn more at https://yulab-smu.top/
## 
## Please cite:
## 
## D Wang, G Chen, L Li, S Wen, Z Xie, X Luo, L
## Zhan, S Xu, J Li, R Wang, Q Wang, G Yu. Reducing language
## barriers, promoting information absorption, and
## communication using fanyi. Chinese Medical Journal. 2024,
## 137(16):1950-1956. doi: 10.1097/CM9.0000000000003242
ggplot()  +
  geom_scatterpie(
    data = cbind(ruspini_scaled, cluster_cmeans$membership),
    aes(x = x, y = y), 
    cols = colnames(cluster_cmeans$membership), 
    legend_name = "Membership") + 
  coord_equal()

7.7 Outliers in Clustering*

Most clustering algorithms perform complete assignment (i.e., all data points need to be assigned to a cluster). Outliers will affect the clustering. It is useful to identify outliers and remove strong outliers prior to clustering. A density based method to identify outlier is LOF (Local Outlier Factor). It is related to dbscan and compares the density around a point with the densities around its neighbors (you have to specify the neighborhood size \(k\)). The LOF value for a regular data point is 1. The larger the LOF value gets, the more likely the point is an outlier.

Add a clear outlier to the scaled Ruspini dataset that is 10 standard deviations above the average for the x axis.

ruspini_scaled_outlier <- ruspini_scaled |> add_case(x=10,y=0)

7.7.1 Visual inspection of the data

Outliers can be identified using summary statistics, histograms, scatterplots (pairs plots), and boxplots, etc. We use here a pairs plot (the diagonal contains smoothed histograms). The outlier is visible as the single separate point in the scatter plot and as the long tail of the smoothed histogram for x (we would expect most observations to fall in the range \[-3,3\] in normalized data).

library("GGally")
ggpairs(ruspini_scaled_outlier, progress = FALSE)

The outlier is a problem for k-means

km <- kmeans(ruspini_scaled_outlier, centers = 4, nstart = 10)
ruspini_scaled_outlier_km <- ruspini_scaled_outlier|>
  add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")

ggplot(ruspini_scaled_outlier_km, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = centroids, 
             aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

This problem can be fixed by increasing the number of clusters and removing small clusters in a post-processing step or by identifying and removing outliers before clustering.

7.7.2 Local Outlier Factor (LOF)

The Local Outlier Factor is related to concepts of DBSCAN can help to identify potential outliers. Calculate the LOF (I choose a local neighborhood size of 10 for density estimation),

lof <- lof(ruspini_scaled_outlier, minPts= 10)
lof
##  [1]  1.0842  1.0062  1.3682  1.2124  1.0830  1.1750  0.9778
##  [8]  0.9982  0.9716  0.9653  0.9903  1.0409  1.0509  1.2518
## [15]  1.0262  0.9377  1.0198  1.0238  1.5716  1.0680  1.2088
## [22]  1.0702  1.1375  1.0022  0.9957  1.5976  1.0340  1.0708
## [29]  1.0145  0.9385  1.6258  0.9758  0.9972  1.0828  1.0206
## [36]  1.0396  1.0214  1.2090  1.0881  0.9809  0.9319  1.0127
## [43]  1.3973  0.9939  1.1603  0.9411  0.9928  0.9427  0.9329
## [50]  1.4943  1.0213  1.0055  0.9920  0.9536  1.0852  1.1429
## [57]  0.9815  1.0200  0.9864  0.9175  1.0340  1.1045  1.0041
## [64]  1.0413  0.9825  1.0393  0.9850  1.0187  0.9625  1.0593
## [71]  1.0008  0.9336  1.0515  1.0234  0.9767 17.8995
ggplot(ruspini_scaled_outlier |> add_column(lof = lof), 
       aes(x, y, color = lof)) +
  geom_point() + 
  scale_color_gradient(low = "gray", high = "red")

Plot the points sorted by increasing LOF and look for a knee.

ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)), 
       aes(index, lof)) +
  geom_line() +
  geom_hline(yintercept = 1, color = "red", linetype = 2)

Choose a threshold above 1.

ggplot(ruspini_scaled_outlier |> add_column(outlier = lof >= 2), 
       aes(x, y, color = outlier)) +
  geom_point()

Analyze the found outliers (they might be interesting data points) and then cluster the data without them.

ruspini_scaled_clean <- ruspini_scaled_outlier  |> filter(lof < 2)

km <- kmeans(ruspini_scaled_clean, centers = 4, nstart = 10)
ruspini_scaled_clean_km <- ruspini_scaled_clean|>
  add_column(cluster = factor(km$cluster))
centroids <- as_tibble(km$centers, rownames = "cluster")

ggplot(ruspini_scaled_clean_km, aes(x = x, y = y, color = cluster)) + 
  geom_point() +
  geom_point(data = centroids, 
             aes(x = x, y = y, color = cluster), 
             shape = 3, size = 10)

There are many other outlier removal strategies available. See, e.g., package outliers.

7.8 Exercises*

We will again use the Palmer penguin data for the exercises.

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

Create a R markdown file with the code and discussion for the following below.

  1. What features do you use for clustering? What about missing values? Discuss your answers. Do you need to scale the data before clustering? Why?
  2. What distance measure do you use to reflect similarities between penguins? See Measures of Similarity and Dissimilarity in Chapter 2.
  3. Apply k-means clustering. Use an appropriate method to determine the number of clusters. Compare the clustering using unscaled data and scaled data. What is the difference? Visualize and describe the results.
  4. Apply hierarchical clustering. Create a dendrogram and discuss what it means.
  5. Apply DBSCAN. How do you choose the parameters? How well does it work?