Clustering

Torben Tvedebrink

May, 2019

Introduction

Terminology

Unsupervised learning

Clustering

PCA vs clustering

Clustering for market segmentation

Clustering is subjective

Females Males

Clustering is subjective

Simpson’s School characters

Two clustering methods

\(K\)-means

\(K\)-means only works for numerical data. Hence, if our data contains non-numerical columns, we either need to discard these, or represent them by numerical values. However, the latter may be dubious, as we introduce a distance between categories.

E.g. if the variable colour in a dataset contains three levels: green, red and yellow, assigning numerical values to these, e.g. green = 1, red = 2 and yellow = 3 we suddenly impose that there is twice the distance between green and yellow than between red and the other two levels. This may have unwanted side-effects.

\(K\)-means

Here is the same data analysed with \(K\)-means for \(K=2\), \(K=3\) and \(K=4\).

A partitioning

\(C_1, \ldots, C_K\) is a partitioning of \(\{1, \ldots, n\}\).

For instance, if the \(i\)th observation is in the \(k\)th cluster, then \(i \in C_k\).

Idea behind \(K\)-means

Within-cluster variation

\(K\)-means clustering algorithm

  1. Randomly assign a number, from \(1\) to \(K\), to each of the observations (initial cluster assignments for the observations)
  2. Iterate until the cluster assignments stop changing:
    • For each of the \(K\) clusters, compute the cluster centroid. The \(k\)th cluster centroid is the vector of the \(p\) feature means for the observations in the \(k\)th cluster.
    • Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Properties of the algorithm

Example

Example

Using different starting values

In R

Use kmeans, see ?kmeans.

Important arguments:

Sums of squares

\[ \bar{x} = n^{-1}\sum_{i=1}^n x_i \quad\text{and}\quad \bar{x}_k = |C_k|^{-1}\sum_{i\in C_k} x_i \]

Total sum of squares and within sums of squares for cluster \(k\): \[ SS_{TOT} = \sum_{i=1}^n (x_i-\bar{x})^2 \quad\text{and}\quad SS_{W_k} = \sum_{i\in C_k} (x_i-\bar{x}_k)^2 \] Within sums of squares and between sums of squares: \[ SS_W = \sum_{k=1}^K SS_{W_k} \quad\text{and}\quad SS_B = SS_{TOT} - SS_W \]

\(SS_W\) can be made arbitrarily small and \(SS_B\) increases with \(K\).

Choosing \(K\)

Example (iris)

iris_noclass <- iris %>% as_tibble() %>% select(-Species)

iris %>% # _noclass %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) + 
  geom_point(aes(colour = Species))

# i_k <- map(2:7, ~ kmeans(x = iris_noclass, centers = .x, nstart = 10))

i2 <- iris_noclass %>% kmeans(2, nstart = 10)
i3 <- iris_noclass %>% kmeans(3, nstart = 10)
i4 <- iris_noclass %>% kmeans(4, nstart = 10)
i5 <- iris_noclass %>% kmeans(5, nstart = 10)
i6 <- iris_noclass %>% kmeans(6, nstart = 10)
i7 <- iris_noclass %>% kmeans(7, nstart = 10)

iris_kmeans <- tibble(K = 2:7,
                      n = nrow(iris_noclass),
                      SS_B = c(i2$betweenss, i3$betweenss, i4$betweenss, i5$betweenss, i6$betweenss, i7$betweenss),
                      SS_W = c(i2$tot.withinss, i3$tot.withinss, i4$tot.withinss, i5$tot.withinss, i6$tot.withinss, i7$tot.withinss),
                      CH = (SS_B/(K-1))/(SS_W/(n-K)),
                      SS_ratio = SS_B/(SS_B+SS_W),
                      SS_W_drop = c(NA,(SS_W[-length(SS_W)] - SS_W[-1])/SS_W[-length(SS_W)])
                      )

ggplot(iris_kmeans, aes(x = K, y = CH)) + geom_line()

(iris_drop <- ggplot(iris_kmeans, aes(x = K, y = SS_W)) + geom_line())

iris_drop + geom_label(data = iris_kmeans %>% filter(K>2), aes(label = paste0(round(SS_W_drop*100,2), "%")))

Alternatives

Hierarchical clustering

Hierarchical clustering

Example 1

Example 2

Example 2 - Cont’d

What is Dissimilarity

To determine which objects that are alike – and which that are different, we need a (dis)similarity measure.

Let \(x_i\) and \(x_j\) be the \(i\)th and \(j\)th observation, respectively. A dissimilarity measure, \(d\), must satisfy:

Pairwise dissimilarities in R

(data <- data.frame(x1 = c(0,1,3), x2=c(0,1,4)))
  x1 x2
1  0  0
2  1  1
3  3  4
(res <- dist(data, method="euclidian", diag=TRUE, upper=TRUE))
         1        2        3
1 0.000000 1.414214 5.000000
2 1.414214 0.000000 3.605551
3 5.000000 3.605551 0.000000
class(res)
[1] "dist"

To enhance the number of dissimilarity measures in R, the proxy package can be attached.

library(proxy)
summary(pr_DB)
* Similarity measures:
Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
Stiles, Tanimoto, Tschuprow, Yule, Yule2

* Distance measures:
Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker

(Dis)similarity between groups

Single linkage

Single linkage (nearest neighbour): the minimal distance between the observations from different clusters: \[ d_\text{Single linkage}(C_k, C_{k'}) = \min_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]

Concept

Example

Complete linkage

Complete linkage (most remote neighbour): the maximal distance between the observations from different clusters: \[ d_\text{Complete linkage}(C_k, C_{k'}) = \max_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]

Concept

Example

Average linkage

Average linkage: average distance between all pairs (one in each cluster) \[ d_\text{Average linkage}(C_k, C_{k'}) = \frac{1}{|C_k|{\cdot}|C_{k'}|}\sum_{i\in C_k, i'\in C_{k'}} d_{ii'}. \]

Concept

Example

Algorithm

  1. Begin with \(n\) observations and a measure (such as Euclidean distance) of all the \(\binom{n}{2} = n(n - 1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
  2. For \(i = n, n-1, \ldots, 2\):
    • Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
    • Compute the new pairwise inter-cluster dissimilarities among the \(i-1\) remaining clusters.

In R

Combine PCA and clustering

If we combine PCA with other techniques we often obtain better results. This is due to the sphering of the data done by PCA. E.g. for classification/regression trees the splits are always perpendicular to the axes (the variables in data). However, splitting may be more efficient if done using a sloped line through the data points. The PCA rotated data if therefore at times better to use – both for classification but also clustering

Example: Crabs

Crabs

Crabs

The dataset data(crabs, package = "MASS") contains five measurements done on 200 crabs of the species Leptograpsus variegatus. There are 50 female and 50 male crabs from each of two colours: blue and orange. Hence four groups of 50 crabs each.

PCA

First we do a PCA on the numerical data

Then we join the PCA rotated data with the sample information

Then we make a plot of the first three PCs

\(K\)-means clustering

The \(K\)-means algorithm can then better identify the clusters in data

If we compare to the true structure in data. For the raw data the results is not that impressive.

       sp_sex
cluster B_F B_M O_F O_M
      1  17  10   4   6
      2  18  16  14  19
      3   1   7  15  11
      4  14  17  17  14

Repeating the analysis the results are much better.

       sp_sex
cluster B_F B_M O_F O_M
      1  46   7   1   0
      2   0  36   0   0
      3   1   0  46   0
      4   3   7   3  50

Caveats

Caveat

\(K\)-means

hclust with complete linkage

hclust with average linkage

hclust with single linkage

Caveats

K-means:

Generally, important choices with many options and large implications:

Dendrograms:

Only the vertical distances are important - the horizontal ordering is (almost) arbitrary as there exists \(2^n-1\) orderings/permutations (chosen to be pretty).

Play around with dendrograms using library(dendextend)

Exercises

  1. Do the example in ?kmeans
    • Optional: Implement the Calinski and Harabasz index and find the best number of clusters
  2. Do the example in ?cutree
    • Do plot(hc) and use rect.hclust
  3. Do 10.5.1 “\(K\)-means Clustering” (Lab 2) in ISLR (An Introduction to Statistical Learning, p. 404)
  4. Do 10.5.2 “Hierarchical Clustering” (Lab 2) in ISLR (An Introduction to Statistical Learning, p. 406)
    • The last part on Correlation-based distance is optional