Implement in R the \(CH\)-index: \[ CH(K) = \frac{B(K)/(K−1)}{W(K)/(n−K)} = \frac{SS_B/(K−1)}{SS_W/(n-K)}\]
Construct the function such that it takes a kmeans
-object as input and returns the \(CH\)-index:
ch_index <- function(kmeans_obj){
...
ch
}
iris
dataMake a \(K\)-means analysis of the iris
dataset where you discard the Species
column.
Identify the appropriate \(K\) using the ch_index
function from above.
For \(K=3\), compare the clustering with the Species
column in iris
. How well did you do?
crabs
dataRecall the crabs
data from the MASS
package used in the PCA analysis (load in R: data(crabs, package = "MASS")
).
crabs %>% unite(col = sp_sex, sp, sex, sep = "_")
FL:BD
and compare with the sp_sex
variable.princomp(crabs %>% select((FL:BD), cor = TRUE))$scores
or prcomp(crabs_ %>% select(FL:BD), scale. = TRUE)$x
.Load day05_boys_names.csv
. NB! To avoid the names to be converted into factors use stringsAsFactors=FALSE
in read.csv
or use readr::read_csv
, which does not do this!
boys_names <- read_csv("day05_boys_names.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## origin = col_character()
## )
boys_names %>% head(n = 3)
## # A tibble: 3 x 2
## name origin
## <chr> <chr>
## 1 Petros Greek
## 2 Peter English
## 3 Piotr Polish
adist
-function to determine the Levenshtein distance between the names.agnes
and/or hclust
to make hierarchical clusters of the names.cutree
to partition the names into three clustres.pam
. See ?pam
for type of arguments?Load the day05_aims_freq.csv
data set. This contain allele frequencies for a number of human populations genotyped at 128 AIMs (Ancestry Informative Markers).
AIMs are genetic Single Nucleotide Polymorphisms (SNP) markers, i.e. only a single DNA base (A, C, T or G) varies between the individual genomes.
A SNP is typically named rs123456
, where rs
is short for “Reference SNP cluster ID” and 123456
can be linked to a specific position in the genome using a database (dbSNP
).
Typically SNPs can only attain two states, e.g. A and T. AIMs can be used to estimate the geographic origin of a DNA profile.
aims_freq <- read_csv("day05_aims_freq.csv")
library(ggplot2)
ggplot() + borders("world", colour="gray50", fill="gray50") +
geom_text(data = aims_freq, aes(x = long, y = lat, label = pop))
Hint To preserve the observation’s names using a rownamed data.frame
is often a good idea. Hence, for the next analyses it is a good idea to use a data.frame
(rather than a tibble
) with the population names as rownames
(this implies that e.g. the dendrograms will have the appropriate names):
aims_freqs <- aims_freq %>% select(starts_with("rs")) %>% as.data.frame()
rownames(aims_freqs) <- aims_freq$pop
aims_freqs
data.Make a hierarchical cluster analysis of the allele frequencies in aims_freq.csv
. That is, select the columns with allele frequencies.
Which dist
method and linkage gives the best concordance with the geographical grouping?
Recall that the proxy
package gives a lot more dist
-methods:
library(proxy)
summary(pr_DB)
## * Similarity measures:
## Braun-Blanquet, Chi-squared, correlation, cosine, Cramer, Dice,
## eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
## Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
## Pearson, Phi, Phi-squared, Russel, simple matching, Simpson,
## Stiles, Tanimoto, Tschuprow, Yule, Yule2
##
## * Distance measures:
## Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean,
## fJaccard, Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis,
## Manhattan, Minkowski, Podani, Soergel, supremum, Wave, Whittaker