A Comparison of Seriation Methods

Author

Michael Hahsler

Introduction

This document compares the seriation methods available in package seriation using the popular Iris data set and randomize the order of the objects.

library("seriation")

data("iris")
x <- as.matrix(iris[sample(nrow(iris)), -5])
d <- dist(x)

Distance seriation

We first register more seriation methods. Some of these methods require the installation of more packages.

register_DendSer()
register_optics()
register_smacof()

The following methods will be used (a few slow methods are skipped).

methods <- sort(list_seriation_methods("dist"))

# skip slow methods
methods <- setdiff(methods, c("BBURCG", "BBWRCG", "Enumerate", "GSA"))
methods

 [1] "ARSA"           "DendSer"        "DendSer_ARc"    "DendSer_BAR"   
 [5] "DendSer_LPL"    "DendSer_PL"     "GW"             "GW_average"    
 [9] "GW_complete"    "GW_single"      "GW_ward"        "HC"            
[13] "HC_average"     "HC_complete"    "HC_single"      "HC_ward"       
[17] "Identity"       "isomap"         "isoMDS"         "MDS"           
[21] "MDS_angle"      "MDS_smacof"     "metaMDS"        "monoMDS"       
[25] "OLO"            "OLO_average"    "OLO_complete"   "OLO_single"    
[29] "OLO_ward"       "optics"         "QAP_2SUM"       "QAP_BAR"       
[33] "QAP_Inertia"    "QAP_LS"         "R2E"            "Random"        
[37] "Reverse"        "Sammon_mapping" "SGD"            "Spectral"      
[41] "Spectral_norm"  "SPIN_NH"        "SPIN_STS"       "TSP"           
[45] "VAT"

Details about the method can be found in the manual page for seriate().

We use a loop to run the function seriate() with each method and calculate criterion measures which indicate how good the order is.

orders <- list()
criterion <- list()
for (m in methods) {
  cat(m)
  tm <- system.time(orders[[m]] <- seriate(d, method = m))
  criterion[[m]] <- data.frame(time = tm[1]+tm[2], rbind(criterion(d, orders[[m]]))) 
  cat(" took", tm[1]+tm[2], "sec.\n")
}

ARSA took 1.008 sec.
DendSer took 1.144 sec.
DendSer_ARc took 0.924 sec.
DendSer_BAR took 1.152 sec.
DendSer_LPL took 1.029 sec.
DendSer_PL took 1.04 sec.
GW took 0.158 sec.
GW_average took 0.157 sec.
GW_complete took 0.159 sec.
GW_single took 0.159 sec.
GW_ward took 0.158 sec.
HC took 0.163 sec.
HC_average took 0.158 sec.
HC_complete took 0.158 sec.
HC_single took 0.157 sec.
HC_ward took 0.159 sec.
Identity took 0.157 sec.
isomap

Registered S3 method overwritten by 'vegan':
  method         from 
  reorder.hclust gclus

 took 0.325 sec.
isoMDS took 0.491 sec.
MDS took 0.199 sec.
MDS_angle took 0.206 sec.
MDS_smacof took 0.398 sec.
metaMDS took 2.353 sec.
monoMDS took 0.175 sec.
OLO took 0.165 sec.
OLO_average took 0.17 sec.
OLO_complete took 0.17 sec.
OLO_single took 0.163 sec.
OLO_ward took 0.167 sec.
optics took 0.183 sec.
QAP_2SUM took 0.275 sec.
QAP_BAR took 0.205 sec.
QAP_Inertia took 0.202 sec.
QAP_LS took 0.207 sec.
R2E took 0.238 sec.
Random took 0.172 sec.
Reverse took 0.176 sec.
Sammon_mapping took 0.472 sec.
SGD took 156.591 sec.
Spectral took 0.233 sec.
Spectral_norm took 0.253 sec.
SPIN_NH took 4.092 sec.
SPIN_STS took 2.458 sec.
TSP took 0.175 sec.
VAT took 0.168 sec.

criterion <- do.call(rbind, criterion)

We align the seriation orders. The reason is that an order 1, 2, 3 and 3, 2, 1 are equivalent and just an artifact of the algorithm. Aligning will reverse some orders so they are better aligned. Then we sort the orders from best to worst according to a popular seriation criterion measure called Gradient_weighted.

orders <- ser_align(orders)
best_to_worse <- order(criterion[["Gradient_weighted"]], decreasing = TRUE)

orders <- orders[best_to_worse]
criterion <- criterion[best_to_worse, ]

Comparison between methods

We can compare the seriation methods by how similar the orders are that they produce (measured using Spearman). The following code calculates distances between orders and then performs hierarchical clustering.

dst <- ser_dist(orders) 
hc <- permute(hclust(dst), order = "OLO", dist = dst)
plot(hc)

The reordered dendrogram clearly shows a group of methods based on hierarchical clustering focused on path length and another group that tries to optimize the other seriation measures.

Here is a table to compare the seriation methods on different criterion measures. Use the interactive table to sort the methods given different measures. Note that some are maximized and some should be minimized. Details about the measures can be found in the manual page for criterion().

library(DT)
datatable(round(criterion, 2), extensions = "FixedColumns",
    options = list(paging = TRUE, searching = TRUE, info = FALSE,
      sort = TRUE, scrollX = TRUE, fixedColumns = list(leftColumns = 1))) %>%
    formatRound(columns = colnames(criterion) , mark = "", digits=1)

Show entries

Search:

	time	X2SUM	AR_deviations	AR_events	ARc	BAR	Gradient_raw	Gradient_weighted	Inertia	Lazy_path_length	Least_squares	LS	MDS_stress	ME	Moore_stress	Neumann_stress	Path_length	RGAR	Rho	smacof_stress0
ARSA	1.0	17811232.6	9438.6	54823.0	2243448.6	56363.0	992260.0	1772128.3	356949760.5	6508.6	76487634.7	4486897.3	0.1	7268.3	1039.1	480.3	86.1	0.1	0.9	0.1
QAP_LS	0.2	17811232.6	9438.6	54823.0	2243448.6	56363.0	992260.0	1772128.3	356949760.5	6508.6	76487634.7	4486897.3	0.2	7268.3	1039.1	480.3	86.1	0.1	0.9	0.2
metaMDS	2.4	17813070.0	9535.6	54694.0	2243481.1	56459.7	992515.0	1772030.7	356956407.5	6604.7	76487764.7	4486962.3	0.2	7262.9	1064.9	502.0	88.3	0.1	0.9	0.2
isoMDS	0.5	17813380.5	9578.1	54655.0	2243500.5	56425.3	992593.0	1771972.6	356964521.1	6841.2	76487842.3	4487001.0	0.1	7259.4	1076.2	515.8	90.4	0.1	0.9	0.1
MDS_smacof	0.4	17812731.4	9471.2	55069.0	2243518.2	56310.2	991764.0	1771919.7	356954170.3	6526.6	76487912.8	4487036.3	0.1	7274.2	981.9	456.7	85.2	0.1	0.9	0.1
QAP_Inertia	0.2	17816748.4	9681.5	54397.0	2243561.5	56529.6	993114.0	1771789.8	356979957.4	6386.8	76488086.0	4487122.9	0.2	7260.9	1076.7	510.0	88.9	0.1	0.9	0.2
SGD	156.6	17818092.7	9765.0	54219.0	2243659.9	56521.6	993468.0	1771494.5	356950919.2	6647.7	76488479.6	4487319.7	0.2	7262.6	1063.3	503.4	88.5	0.1	0.9	0.2
Spectral	0.2	17821593.3	9887.4	54924.0	2243682.3	56610.0	992058.0	1771427.2	356910342.6	6987.1	76488569.5	4487364.7	0.2	7253.7	1111.7	538.8	91.3	0.1	0.9	0.2
Sammon_mapping	0.5	17803289.9	9704.7	61996.0	2244220.0	55760.1	977903.0	1769814.2	356624718.4	6477.7	76490720.1	4488440.0	0.1	7292.6	854.1	396.5	81.3	0.1	0.9	0.1
QAP_2SUM	0.3	17800666.3	9735.0	62308.0	2244230.8	55820.4	977282.0	1769781.8	356638514.4	6608.7	76490763.2	4488461.5	0.2	7283.4	881.9	419.9	84.7	0.1	0.9	0.2

Previous1 2 3 4 5Next

Visualize the results

Plot the reordered dissimilarity matrices. Dark blocks along the main diagonal mean that the order reveals a “cluster” of similar objects. The Iris dataset contains three species, but two of them are very similar, so we expect to see one smaller block and one larger block.

for (n in names(orders))
  pimage(d, orders[[n]], main = n , key = FALSE)

Matrix seriation

Matrix seriation reorders rows and columns of a data matrix. We perform the same steps as for distances in the previous section.

methods <- sort(list_seriation_methods("matrix"))

# AOE if for correlation matrices only
methods <- setdiff(methods, c("AOE"))
methods

 [1] "BEA"       "BEA_TSP"   "CA"        "Heatmap"   "Identity"  "LLE"      
 [7] "Mean"      "PCA"       "PCA_angle" "Random"    "Reverse"

Performing seriation.

orders <- list()
criterion <- list()
for (m in methods) {
  cat(m)
  tm <- system.time(orders[[m]] <- seriate(x, method = m))
  criterion[[m]] <- data.frame(time = tm[1]+tm[2], rbind(criterion(x, orders[[m]])))
  cat(" took", tm[1]+tm[2], "sec.\n")
}

BEA took 1.766 sec.
BEA_TSP took 1.691 sec.
CA took 0.137 sec.
Heatmap took 0.39 sec.
Identity took 0.128 sec.
LLE took 0.385 sec.
Mean took 0.146 sec.
PCA took 0.133 sec.
PCA_angle took 0.129 sec.
Random took 0.127 sec.
Reverse took 0.128 sec.

criterion <- do.call(rbind, criterion)

Comparison between methods

datatable(round(criterion, 2), extensions = "FixedColumns",
    options = list(paging = TRUE, searching = TRUE, info = FALSE,
      sort = TRUE, scrollX = TRUE, fixedColumns = list(leftColumns = 1))) %>%
    formatRound(columns = colnames(criterion) , mark = "", digits = 1)

Show entries

Search:

	time	Cor_R	ME	Moore_stress	Neumann_stress
BEA	1.8	0.1	16517.7	19784.5	6666.0
BEA_TSP	1.7	-0.1	16523.1	19774.0	6644.4
CA	0.1	0.2	16401.6	20347.3	6934.2
Heatmap	0.4	-0.0	15130.2	12980.9	4395.2
Identity	0.1	-0.0	14092.4	22001.0	8684.8
LLE	0.4	-0.1	15111.5	13074.1	4486.0
Mean	0.1	-0.1	15106.8	12999.3	4533.2
PCA	0.1	-0.1	15120.5	13052.2	4484.8
PCA_angle	0.1	0.0	14291.7	16317.3	5564.0
Random	0.1	0.0	14478.0	16792.4	7122.0

Previous1 2Next

Visualize the results

best_to_worse <- order(criterion[["Moore_stress"]], decreasing = FALSE)

orders <- orders[best_to_worse]
criterion <- criterion[best_to_worse, ]

for (n in names(orders))
  pimage(x, orders[[n]], main = n , key = FALSE)