$K$-means

$CH$-index

Implement in R the $CH$-index: \[ CH(K) = \frac{B(K)/(K−1)}{W(K)/(n−K)} = \frac{SS_B/(K−1)}{SS_W/(n-K)}\]

Construct the function such that it takes a kmeans-object as input and returns the $CH$-index:

ch_index <- function(kmeans_obj){
  ...
  ch
}

`iris` data

Make a $K$-means analysis of the iris dataset where you discard the Species column.

Identify the appropriate $K$ using the ch_index function from above.

For $K=3$, compare the clustering with the Species column in iris. How well did you do?

`crabs` data

Recall the crabs data from the MASS package used in the PCA analysis (load in R: data(crabs, package = "MASS")).

Construct a new “class” which is the species (colour) and sex: crabs %>% unite(col = sp_sex, sp, sex, sep = "_")
Make a $K$-means cluster analysis with $K = 4$ on the variables FL:BD and compare with the sp_sex variable.
Repeat the PCA analysis from the lecture. Make a $K$-means analysis using only the 2nd and 3rd principal components, e.g. princomp(crabs %>% select((FL:BD), cor = TRUE))$scores or prcomp(crabs_ %>% select(FL:BD), scale. = TRUE)$x.

Hierarchical clustering

Boys names

Load day05_boys_names.csv. NB! To avoid the names to be converted into factors use stringsAsFactors=FALSE in read.csv or use readr::read_csv, which does not do this!

boys_names <- read_csv("day05_boys_names.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   origin = col_character()
## )

boys_names %>% head(n = 3)

## # A tibble: 3 x 2
##   name   origin 
##   <chr>  <chr>  
## 1 Petros Greek  
## 2 Peter  English
## 3 Piotr  Polish

Use the adist-function to determine the Levenshtein distance between the names.
Use agnes and/or hclust to make hierarchical clusters of the names.
Use cutree to partition the names into three clustres.
Is this the same splitting as obtained using $k$-mediods, pam. See ?pam for type of arguments?

Ancestry informative markers

Load the day05_aims_freq.csv data set. This contain allele frequencies for a number of human populations genotyped at 128 AIMs (Ancestry Informative Markers).

AIMs are genetic Single Nucleotide Polymorphisms (SNP) markers, i.e. only a single DNA base (A, C, T or G) varies between the individual genomes.

A SNP is typically named rs123456, where rs is short for “Reference SNP cluster ID” and 123456 can be linked to a specific position in the genome using a database (dbSNP).

Typically SNPs can only attain two states, e.g. A and T. AIMs can be used to estimate the geographic origin of a DNA profile.

aims_freq <- read_csv("day05_aims_freq.csv")

library(ggplot2)
ggplot() + borders("world", colour="gray50", fill="gray50") + 
  geom_text(data = aims_freq, aes(x = long, y = lat, label = pop))

Hint To preserve the observation’s names using a rownamed data.frame is often a good idea. Hence, for the next analyses it is a good idea to use a data.frame (rather than a tibble) with the population names as rownames (this implies that e.g. the dendrograms will have the appropriate names):

aims_freqs <- aims_freq %>% select(starts_with("rs")) %>% as.data.frame()
rownames(aims_freqs) <- aims_freq$pop

PCA

Make a PCA of aims_freqs data.
Plot the results in the PC1 and PC2 plane.
How many PCs would be sufficient to use in this case?

Hierarchical clustering

Make a hierarchical cluster analysis of the allele frequencies in aims_freq.csv. That is, select the columns with allele frequencies.

Which dist method and linkage gives the best concordance with the geographical grouping?

Recall that the proxy package gives a lot more dist-methods:

library(proxy)
summary(pr_DB)

## * Similarity measures:
## Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
## eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
## Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
## Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
## Stiles, Tanimoto, Tschuprow, Yule, Yule2
## 
## * Distance measures:
## Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
## fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
## Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker

Clustering exercises

Torben Tvedebrink

May, 2019

\(K\)-means

\(CH\)-index

`iris` data

`crabs` data

Hierarchical clustering

Boys names

Ancestry informative markers

PCA

Hierarchical clustering

Clustering exercises

Torben Tvedebrink

May, 2019

\(K\)-means

\(CH\)-index

iris data

crabs data

Hierarchical clustering

Boys names

Ancestry informative markers

PCA

Hierarchical clustering

`iris` data

`crabs` data