\(K\)-means

\(CH\)-index

Implement in R the \(CH\)-index: \[ CH(K) = \frac{B(K)/(K−1)}{W(K)/(n−K)} = \frac{SS_B/(K−1)}{SS_W/(n-K)}\]

Construct the function such that it takes a kmeans-object as input and returns the \(CH\)-index:

ch_index <- function(kmeans_obj){
  ...
  ch
}

iris data

Make a \(K\)-means analysis of the iris dataset where you discard the Species column.

Identify the appropriate \(K\) using the ch_index function from above.

For \(K=3\), compare the clustering with the Species column in iris. How well did you do?

crabs data

Recall the crabs data from the MASS package used in the PCA analysis (load in R: data(crabs, package = "MASS")).

  • Construct a new “class” which is the species (colour) and sex: crabs %>% unite(col = sp_sex, sp, sex, sep = "_")
  • Make a \(K\)-means cluster analysis with \(K = 4\) on the variables FL:BD and compare with the sp_sex variable.
  • Repeat the PCA analysis from the lecture. Make a \(K\)-means analysis using only the 2nd and 3rd principal components, e.g. princomp(crabs %>% select((FL:BD), cor = TRUE))$scores or prcomp(crabs_ %>% select(FL:BD), scale. = TRUE)$x.

Hierarchical clustering

Boys names

Load day05_boys_names.csv. NB! To avoid the names to be converted into factors use stringsAsFactors=FALSE in read.csv or use readr::read_csv, which does not do this!

boys_names <- read_csv("day05_boys_names.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   origin = col_character()
## )
boys_names %>% head(n = 3)
## # A tibble: 3 x 2
##   name   origin 
##   <chr>  <chr>  
## 1 Petros Greek  
## 2 Peter  English
## 3 Piotr  Polish
  1. Use the adist-function to determine the Levenshtein distance between the names.
  2. Use agnes and/or hclust to make hierarchical clusters of the names.
  3. Use cutree to partition the names into three clustres.
  4. Is this the same splitting as obtained using \(k\)-mediods, pam. See ?pam for type of arguments?

Ancestry informative markers

Load the day05_aims_freq.csv data set. This contain allele frequencies for a number of human populations genotyped at 128 AIMs (Ancestry Informative Markers).

AIMs are genetic Single Nucleotide Polymorphisms (SNP) markers, i.e. only a single DNA base (A, C, T or G) varies between the individual genomes.

A SNP is typically named rs123456, where rs is short for “Reference SNP cluster ID” and 123456 can be linked to a specific position in the genome using a database (dbSNP).

Typically SNPs can only attain two states, e.g. A and T. AIMs can be used to estimate the geographic origin of a DNA profile.

aims_freq <- read_csv("day05_aims_freq.csv")

library(ggplot2)
ggplot() + borders("world", colour="gray50", fill="gray50") + 
  geom_text(data = aims_freq, aes(x = long, y = lat, label = pop))

Hint To preserve the observation’s names using a rownamed data.frame is often a good idea. Hence, for the next analyses it is a good idea to use a data.frame (rather than a tibble) with the population names as rownames (this implies that e.g. the dendrograms will have the appropriate names):

aims_freqs <- aims_freq %>% select(starts_with("rs")) %>% as.data.frame()
rownames(aims_freqs) <- aims_freq$pop

PCA

  • Make a PCA of aims_freqs data.
  • Plot the results in the PC1 and PC2 plane.
  • How many PCs would be sufficient to use in this case?

Hierarchical clustering

Make a hierarchical cluster analysis of the allele frequencies in aims_freq.csv. That is, select the columns with allele frequencies.

Which dist method and linkage gives the best concordance with the geographical grouping?

Recall that the proxy package gives a lot more dist-methods:

library(proxy)
summary(pr_DB)
## * Similarity measures:
## Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
## eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
## Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
## Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
## Stiles, Tanimoto, Tschuprow, Yule, Yule2
## 
## * Distance measures:
## Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
## fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
## Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker