Torben Tvedebrink
May, 2019
Females | Males |
---|---|
Simpson’s | School characters |
---|---|
\(K\)-means only works for numerical data. Hence, if our data contains non-numerical columns, we either need to discard these, or represent them by numerical values. However, the latter may be dubious, as we introduce a distance between categories.
E.g. if the variable colour
in a dataset contains three levels: green
, red
and yellow
, assigning numerical values to these, e.g. green = 1
, red = 2
and yellow = 3
we suddenly impose that there is twice the distance between green
and yellow
than between red
and the other two levels. This may have unwanted side-effects.
Here is the same data analysed with \(K\)-means for \(K=2\), \(K=3\) and \(K=4\).
\(C_1, \ldots, C_K\) is a partitioning of \(\{1, \ldots, n\}\).
For instance, if the \(i\)th observation is in the \(k\)th cluster, then \(i \in C_k\).
Using different starting values
Use kmeans
, see ?kmeans
.
Important arguments:
x
: the data (as matrix
, data.frame
or tibble
)centers
: if numerical this is \(K\), if matrix the starting centers to be usednstart
: The number of re-runs of the algorithm to decrease the effect of unfortunate initial centers (set nstart
to 10
or 50
)\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \quad\text{and}\quad \bar{x}_k = |C_k|^{-1}\sum_{i\in C_k} x_i \]
Total sum of squares and within sums of squares for cluster \(k\): \[ SS_{TOT} = \sum_{i=1}^n (x_i-\bar{x})^2 \quad\text{and}\quad SS_{W_k} = \sum_{i\in C_k} (x_i-\bar{x}_k)^2 \] Within sums of squares and between sums of squares: \[ SS_W = \sum_{k=1}^K SS_{W_k} \quad\text{and}\quad SS_B = SS_{TOT} - SS_W \]
\(SS_W\) can be made arbitrarily small and \(SS_B\) increases with \(K\).
CH Index: Calinski and Harabasz (1974): “A dendrite method for cluster analysis”. \[ CH(K) = \frac{B(K)/(K - 1)}{W(K)/(n - K)} = \frac{SS_B/(K-1)}{SS_W/(n-K)} \] where \(B(K)\) and \(W(K)\) are between-sum-of-squares and within-sum-of-squares, respectively, for \(K\) clusters.
As large as possible
This is similar to looking for the “albow” effect in a plot of the drop in \(SS_W\)
iris_noclass <- iris %>% as_tibble() %>% select(-Species)
iris %>% # _noclass %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(colour = Species))
# i_k <- map(2:7, ~ kmeans(x = iris_noclass, centers = .x, nstart = 10))
i2 <- iris_noclass %>% kmeans(2, nstart = 10)
i3 <- iris_noclass %>% kmeans(3, nstart = 10)
i4 <- iris_noclass %>% kmeans(4, nstart = 10)
i5 <- iris_noclass %>% kmeans(5, nstart = 10)
i6 <- iris_noclass %>% kmeans(6, nstart = 10)
i7 <- iris_noclass %>% kmeans(7, nstart = 10)
iris_kmeans <- tibble(K = 2:7,
n = nrow(iris_noclass),
SS_B = c(i2$betweenss, i3$betweenss, i4$betweenss, i5$betweenss, i6$betweenss, i7$betweenss),
SS_W = c(i2$tot.withinss, i3$tot.withinss, i4$tot.withinss, i5$tot.withinss, i6$tot.withinss, i7$tot.withinss),
CH = (SS_B/(K-1))/(SS_W/(n-K)),
SS_ratio = SS_B/(SS_B+SS_W),
SS_W_drop = c(NA,(SS_W[-length(SS_W)] - SS_W[-1])/SS_W[-length(SS_W)])
)
ggplot(iris_kmeans, aes(x = K, y = CH)) + geom_line()
(iris_drop <- ggplot(iris_kmeans, aes(x = K, y = SS_W)) + geom_line())
iris_drop + geom_label(data = iris_kmeans %>% filter(K>2), aes(label = paste0(round(SS_W_drop*100,2), "%")))
pam
in the cluster
packagelibrary(mclust); vignette("mclust")
library(dbscan)
To determine which objects that are alike – and which that are different, we need a (dis)similarity measure.
Let \(x_i\) and \(x_j\) be the \(i\)th and \(j\)th observation, respectively. A dissimilarity measure, \(d\), must satisfy:
(data <- data.frame(x1 = c(0,1,3), x2=c(0,1,4)))
x1 x2
1 0 0
2 1 1
3 3 4
(res <- dist(data, method="euclidian", diag=TRUE, upper=TRUE))
1 2 3
1 0.000000 1.414214 5.000000
2 1.414214 0.000000 3.605551
3 5.000000 3.605551 0.000000
class(res)
[1] "dist"
To enhance the number of dissimilarity measures in R, the proxy
package can be attached.
library(proxy)
summary(pr_DB)
* Similarity measures:
Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
Stiles, Tanimoto, Tschuprow, Yule, Yule2
* Distance measures:
Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker
Single linkage (nearest neighbour): the minimal distance between the observations from different clusters: \[ d_\text{Single linkage}(C_k, C_{k'}) = \min_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
Complete linkage (most remote neighbour): the maximal distance between the observations from different clusters: \[ d_\text{Complete linkage}(C_k, C_{k'}) = \max_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
Average linkage: average distance between all pairs (one in each cluster) \[ d_\text{Average linkage}(C_k, C_{k'}) = \frac{1}{|C_k|{\cdot}|C_{k'}|}\sum_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]
Concept
Example
hclust
cutree
If we combine PCA with other techniques we often obtain better results. This is due to the sphering of the data done by PCA. E.g. for classification/regression trees the splits are always perpendicular to the axes (the variables in data). However, splitting may be more efficient if done using a sloped line through the data points. The PCA rotated data if therefore at times better to use – both for classification but also clustering
The dataset data(crabs, package = "MASS")
contains five measurements done on 200 crabs of the species Leptograpsus variegatus. There are 50 female and 50 male crabs from each of two colours: blue and orange. Hence four groups of 50 crabs each.
First we do a PCA on the numerical data
Then we join the PCA rotated data with the sample information
Then we make a plot of the first three PCs
The \(K\)-means algorithm can then better identify the clusters in data
If we compare to the true structure in data. For the raw data the results is not that impressive.
sp_sex
cluster B_F B_M O_F O_M
1 17 10 4 6
2 18 16 14 19
3 1 7 15 11
4 14 17 17 14
Repeating the analysis the results are much better.
sp_sex
cluster B_F B_M O_F O_M
1 46 7 1 0
2 0 36 0 0
3 1 0 46 0
4 3 7 3 50
K-means:
mclust
)Generally, important choices with many options and large implications:
Dendrograms:
Only the vertical distances are important - the horizontal ordering is (almost) arbitrary as there exists \(2^n-1\) orderings/permutations (chosen to be pretty).
Play around with dendrograms using library(dendextend)
?kmeans
?cutree
plot(hc)
and use rect.hclust