| Females | Males |
|---|---|
| Simpson’s | School characters |
|---|---|
\(K\)-means only works for numerical data. Hence, if our data contains non-numerical columns, we either need to discard these, or represent them by numerical values. However, the latter may be dubious, as we introduce a distance between categories.
E.g. if the variable colour in a dataset contains three levels: green, red and yellow, assigning numerical values to these, e.g. green = 1, red = 2 and yellow = 3 we suddenly impose that there is twice the distance between green and yellow than between red and the other two levels. This may have unwanted side-effects.
Here is the same data analysed with \(K\)-means for \(K=2\), \(K=3\) and \(K=4\).
\(C_1, \ldots, C_K\) is a partitioning of \(\{1, \ldots, n\}\).
For instance, if the \(i\)th observation is in the \(k\)th cluster, then \(i \in C_k\).
Using different starting values
Use kmeans, see ?kmeans.
Important arguments:
x: the data (as matrix, data.frame or tibble)centers: if numerical this is \(K\), if matrix the starting centers to be usednstart: The number of re-runs of the algorithm to decrease the effect of unfortunate initial centers (set nstart to 10 or 50)\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \quad\text{and}\quad \bar{x}_k = |C_k|^{-1}\sum_{i\in C_k} x_i \]
Total sum of squares and within sums of squares for cluster \(k\): \[ SS_{TOT} = \sum_{i=1}^n (x_i-\bar{x})^2 \quad\text{and}\quad SS_{W_k} = \sum_{i\in C_k} (x_i-\bar{x}_k)^2 \] Within sums of squares and between sums of squares: \[ SS_W = \sum_{k=1}^K SS_{W_k} \quad\text{and}\quad SS_B = SS_{TOT} - SS_W \]
\(SS_W\) can be made arbitrarily small and \(SS_B\) increases with \(K\).
CH Index: Calinski and Harabasz (1974): “A dendrite method for cluster analysis”. \[ CH(K) = \frac{B(K)/(K - 1)}{W(K)/(n - K)} = \frac{SS_B/(K-1)}{SS_W/(n-K)} \] where \(B(K)\) and \(W(K)\) are between-sum-of-squares and within-sum-of-squares, respectively, for \(K\) clusters.
As large as possible
This is similar to looking for the “albow” effect in a plot of the drop in \(SS_W\)
iris_noclass <- iris %>% as_tibble() %>% select(-Species)
iris %>% # _noclass %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(colour = Species))
i2 <- iris_noclass %>% kmeans(2, nstart = 10)
i3 <- iris_noclass %>% kmeans(3, nstart = 10)
i4 <- iris_noclass %>% kmeans(4, nstart = 10)
i5 <- iris_noclass %>% kmeans(5, nstart = 10)
i6 <- iris_noclass %>% kmeans(6, nstart = 10)
i7 <- iris_noclass %>% kmeans(7, nstart = 10)
iris_kmeans <- tibble(K = 2:7,
n = nrow(iris_noclass),
SS_B = c(i2$betweenss, i3$betweenss, i4$betweenss, i5$betweenss, i6$betweenss, i7$betweenss),
SS_W = c(i2$tot.withinss, i3$tot.withinss, i4$tot.withinss, i5$tot.withinss, i6$tot.withinss, i7$tot.withinss),
CH = (SS_B/(K-1))/(SS_W/(n-K)),
SS_ratio = SS_B/(SS_B+SS_W),
SS_W_drop = c(NA,(SS_W[-length(SS_W)] - SS_W[-1])/SS_W[-length(SS_W)])
)
ggplot(iris_kmeans, aes(x = K, y = CH)) + geom_line()
(iris_drop <- ggplot(iris_kmeans, aes(x = K, y = SS_W)) + geom_line())
iris_drop + geom_label(data = iris_kmeans %>% filter(K>2), aes(label = paste0(round(SS_W_drop*100,2), "%")))
pam in the cluster packagelibrary(mclust); vignette("mclust")library(dbscan)To determine which objects that are alike – and which that are different, we need a (dis)similarity measure.
Let \(x_i\) and \(x_j\) be the \(i\)th and \(j\)th observation, respectively. A dissimilarity measure, \(d\), must satisfy:
(data <- data.frame(x1 = c(0,1,3), x2=c(0,1,4)))
x1 x2
1 0 0
2 1 1
3 3 4
(res <- dist(data, method="euclidian", diag=TRUE, upper=TRUE))
1 2 3
1 0.000000 1.414214 5.000000
2 1.414214 0.000000 3.605551
3 5.000000 3.605551 0.000000
class(res)
[1] "dist"
To enhance the number of dissimilarity measures in R, the proxy package can be attached.
library(proxy)
summary(pr_DB)
* Similarity measures:
Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
Stiles, Tanimoto, Tschuprow, Yule, Yule2
* Distance measures:
Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker
Single linkage (nearest neighbour): the minimal distance between the observations from different clusters: \[ d_\text{Single linkage}(C_k, C_{k'}) = \min_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
Complete linkage (most remote neighbour): the maximal distance between the observations from different clusters: \[ d_\text{Complete linkage}(C_k, C_{k'}) = \max_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
Average linkage: average distance between all pairs (one in each cluster) \[ d_\text{Average linkage}(C_k, C_{k'}) = \frac{1}{|C_k|{\cdot}|C_{k'}|}\sum_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
hclustcutreeIf we combine PCA with other techniques we often obtain better results. This is due to the sphering of the data done by PCA. E.g. for classification/regression trees the splits are always perpendicular to the axes (the variables in data). However, splitting may be more efficient if done using a sloped line through the data points. The PCA rotated data if therefore at times better to use – both for classification but also clustering
Crabs
The dataset data(crabs, package = "MASS") contains five measurements done on 200 crabs of the species Leptograpsus variegatus. There are 50 female and 50 male crabs from each of two colours: blue and orange. Hence four groups of 50 crabs each.
First we do a PCA on the numerical data
Then we join the PCA rotated data with the sample information
Then we make a plot of the first three PCs
The \(K\)-means algorithm can then better identify the clusters in data
If we compare to the true structure in data
sp_sex
cluster B_F B_M O_F O_M
1 14 17 17 14
2 17 10 4 6
3 18 16 14 19
4 1 7 15 11
sp_sex
cluster B_F B_M O_F O_M
1 3 7 3 50
2 1 0 46 0
3 46 7 1 0
4 0 36 0 0
K-means:
mclust)Generally, important choices with many options and large implications:
Dendrograms:
Only the vertical distances are important - the horizontal ordering is (almost) arbitrary as there exists \(2^n-1\) orderings/permutations (chosen to be pretty).
Play around with dendrograms using library(dendextend)
?kmeans
?cutree
plot(hc) and use rect.hclust