library(tidyverse)
theme_set(theme_bw())

1 Principal Component Analysis

In high dimensional problems it may be hard to obtain insights from pairwise plots of the different variables in the data. There are \({p \choose 2} = p(p-1)/2\) combinations (when ignoring trivial flips of the axes). Hence, even for moderate \(p\) this is an intractable number of plots..

A solution is to rotate the data to obtain views of the data which contain as much information as possible. One criterion for “information” is variablity.

Recall that for two random variables \(X\) and \(Y\), we know that the linear combination \(Z = aX+bY\) for scalars \(a\) and \(b\) has variance given by: \[\mathbb{V}(Z) = \mathbb{V}(aX + bY) = a^2\mathbb{V}(X) + b^2\mathbb{V}(Y) + 2ab\mathbb{C}(X,Y),\] where \(\mathbb{V}(X)\) is the variance of \(X\) and \(\mathbb{C}(X,Y)\) the covariance of \(X\) and \(Y\).

We can choose \(a\) and \(b\) arbitrarily, but we would like to have the variance of \(aX+bY\) maximal. In order for this to be well-defined we must limit \(a\) and \(b\): \(a^2 + b^2 = 1\), otherwise we could make the variance arbitrarily large. The values of \(a\) and \(b\) defines the rotation of \(X\) and \(Y\) holding maximum variance.

This idea is the basis of PCA - rotate the data such that the resulting components have maximum variance. For the next components to differ from the previous components we impose the requirement that they are pairwise uncorrelated with the previous components.

1.1 Illustrative example (in two dimensions)

The data shown below gives and example of how the intrinsic dimension of a data set may be smaller than the actual number of columns in the data.

ISLR Fig 6.14

ISLR Fig 6.14

We see that the first and second PC necessarily needs to be perpendicular to each other in this simple case to be uncorrelated. Below, the green line show PC1, whereas the dashed lines show the perpendicular distances from each observation to PC1. From the definition of PC2, these distances (computed with sign) are therefore the PC2 coefficients.

ISLR Fig 6.15

ISLR Fig 6.15

In the plot below we see how PC1 loads on to the underlying variables. Because PC1 is given by \(a_{1,1}\)Population + \(a_{1,2}\)Ad Spending we expect this linear relationship between PC1 and the variables. However, for higher dimensional data the loads \(a_{k, j}\) may be zero for various combinations of \(k\) (PC) and \(j\) (variable).

ISLR Fig 6.16

ISLR Fig 6.16

In the picture below we see that PC2 is less correlated to the data, which also suggest that PC1 captures most of the variability.

ISLR Fig 6.17

ISLR Fig 6.17

1.2 PCA is found by eigenvalues

The general problem of finding the PCA for \(X\) being \(n\times p\) data can be solved by finding the eigenvalues of the covariance (or correlation) matrix. Let \(\Sigma\) denote the covariance matrix of \(X\). That is, \(\mathbb(X) = \Sigma\).

For a linear transformation \(X a\), where \(a\) is a \(p\times 1\) dimensional vector, the covariance is \(a^\top\Sigma a\). Hence, we want to maximize this variance with respect to \(a\) while having \(a^\top a = 1\) (as argued for above). This corresponds to solving \[(\Sigma - \lambda I)a = 0,\] where \(\lambda\) is a Lagrangian multiplier. The PCs are unique up to a flip of sign.

For data where the variables are measured at different scales it is typically a good idea to scale the variables first. This corresponds to using the correlation matrix instead.

1.3 The best approximating low-dimensional linear surface

The PCA can also be viewed as the best linear surface approximating the data in lower dimension.

ISLR Fig 10.2 ISLR Fig 10.2

2 Examples

2.1 PCA analysis of USArrest data

First we may make a pairs plot of the data

pairs(USArrests)

GGally::ggpairs(USArrests)

We see that Murder, Assault and Rape are somewhat correlated and more scatter when plotting UrbanPop against the three measures of crimes.

We clearly see that the observations have different scales. Hence, it is advisable to use the correlation matrix rather than the covariance matrix for the PCA rotation.

head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
summary(USArrests)
     Murder          Assault         UrbanPop          Rape      
 Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
 1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
 Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
 Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
 3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
 Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00  

Hence, we use the scale. = TRUE in the call to standardise (that is use the correlation matrix) rather than the unit based covariances

USArrests_pca <- prcomp(USArrests, scale. = TRUE)
USArrests_pca
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
plot(USArrests_pca)

After the PCA rotation, we can yet again make a pairs plot

GGally::ggpairs(USArrests_pca$x %>% as_tibble())

We notice the null-correlations of the PCs

Proportion of variance explained

cumsum(USArrests_pca$sdev^2)/sum(USArrests_pca$sdev^2)
[1] 0.6200604 0.8675017 0.9566425 1.0000000

2.1.1 Biplots

Showing both the data and variables in the same plots

par(mfrow = c(2,2), mar = c(1,1,1,1)*4)
biplot(USArrests_pca, scale = 0, choices = c(1,2))
biplot(USArrests_pca, scale = 0, choices = c(1,3))
biplot(USArrests_pca, scale = 0, choices = c(2,3))

USArrests_pca_ <- prcomp(USArrests)
biplot(USArrests_pca_, scale = 0, choices = c(1,2))

Note, in ISLR they flip the sign of the PCs to obtain an easier interpretation of the PCs - however, this is merely for interpretation purposes.

USArrests_pca$x <- -USArrests_pca$x

2.2 Crabs

In the MASS package we have the data set `crabs``that contain measurements on a specific species of crabs (Leptograpsus).

Crabs

Crabs

The data contains data from 50 males and 50 female crabs of each of two subspecies: blue and orange.

ccol <- function(sp) ifelse(sp=="B","#0f7fa9","#fa8d0f")
cpch <- function(sx) 1 + 15*(crabs$sex=="M")

data(crabs, package = "MASS")
pairs(select(crabs,FL:BD), col = ccol(crabs$sp), pch = cpch(crabs$crabs))

Let us consider a PCA rotation

(crab_pca <- prcomp(select(crabs, FL:BD), scale. = TRUE))
Standard deviations (1, .., p=5):
[1] 2.18834065 0.38946785 0.21594669 0.10552420 0.04137243

Rotation (n x k) = (5 x 5):
         PC1        PC2         PC3          PC4         PC5
FL 0.4520437  0.1375813  0.53076841  0.696923372  0.09649156
RW 0.4280774 -0.8981307 -0.01197915 -0.083703203 -0.05441759
CL 0.4531910  0.2682381 -0.30968155 -0.001444633 -0.79168267
CW 0.4511127  0.1805959 -0.65256956  0.089187816  0.57452672
BD 0.4511336  0.2643219  0.44316103 -0.706636423  0.17574331
crabs_pca <- predict(crab_pca) %>% as_tibble() ## Rotate the data

Plot the results

par(mfrow = c(2,2), mar = rep(4,4))
plot(crab_pca, main = "Screeplot for correlations matrix")
plot(PC1~PC2,data=crabs_pca, col=ccol(crabs$sp), pch=cpch(crabs$sex))
plot(PC1~PC2,data=crabs_pca, col=ccol(crabs$sp), pch=cpch(crabs$sex))
plot(PC2~PC3,data=crabs_pca, col=ccol(crabs$sp), pch=cpch(crabs$sex))

2.3 \(t\)-SNE - an alternative to PCA for visualisation

One of the more recent alternatives to PCA is the so called \(t\)-SNE method. In R this is implemented in Rtsne package. Different to PCA, the \(t\)-SNE method works by finding a 2- or 3-dimensional (highly) non-linear mapping of higher dimensional distances. There are several tuning parameters one can take into account. See Google Brain’s “How to Use t-SNE Effectively” for a great introduction.

2.3.1 PCA vs. t-SNE

Although both PCA and t-SNE have their own advantages and disadvantages, some key differences between PCA and t-SNE can be noted as follows:

  • \(t\)-SNE is computationally expensive and can take several hours on million-sample data sets where PCA will finish in seconds or minutes.
  • PCA it is a mathematical technique, but t-SNE is a probabilistic one.
    • Linear dimensionality reduction algorithms, like PCA, concentrate on placing dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, non-linear manifold, it is essential that similar data points must be represented close together, which is something t-SNE does not PCA.
  • Sometimes in \(t\)-SNE different runs with the same hyper-parameters may produce different results hence multiple plots must be observed before making any assessment with \(t\)-SNE, while this is not the case with PCA.
  • Since PCA is a linear algorithm, it will not be able to interpret the complex polynomial relationship between features while \(t\)-SNE is made to capture exactly that.