1 Weight of potatoes (multiple regression)
- 1.1 A regression model
2 Length of odontoblasts in guniea pigs (ANOVA)
- 2.1 A regression model
3 Analysis of covariance (ANCOVA) models
- 3.1 A regression model
4 Absolute zero

1 Weight of potatoes (multiple regression)

This dataset contains weight and two sizes of 20 potatoes. Weight in grams; size in milimeter. There are two sizes: is the longest length and is the shortest length across a potato.

potatoes <- read.table("./data/potatoes.txt", header = TRUE)
head(potatoes, 4)

##   weight length width
## 1     22     38    29
## 2     41     46    34
## 3     24     40    31
## 4     16     33    28

We are interested in modelling how weight changes when and changes. We say that weight is the response variable while length and width are explanatory variables. A commonly used name for this setting is multiple regression. If only one variable is included as explanatory, then the the setting is called linear regression (or simple linear regression).

Initially, we plot weight against length and width to get an overview of the data, see Fig. (fig:linear:potato1). The relationship between weight and length seems to follow an approximately straight line. Likewise, it the relationship between weight and width appears to have a curvature, which can perhaps be captured by a parabola. However, in both plots it is clear that the relationships are not exact: Points scatter around a straight line and a parabola.

Plot of weight against length and against width of potatoes.

1.1 A regression model

pota <- lm(weight ~ length + width + I(width^2), data = potatoes)
tidy(pota)

## # A tibble: 4 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   8.84      7.26        1.22 0.241    
## 2 length        0.831     0.135       6.16 0.0000137
## 3 width        -2.62      0.570      -4.59 0.000299 
## 4 I(width^2)    0.0681    0.0121      5.62 0.0000386

2 Length of odontoblasts in guniea pigs (ANOVA)

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods: orange juice (coded as OJ) or ascorbic acid a form of vitamin C and (coded as VC). In the following we shall allow also to regard dose as a factor with three levels:

ToothGrowth <- ToothGrowth %>% 
    mutate(dosef = factor(dose, levels = c(0.5, 1, 2), 
                          labels = c("LO", "ME", "HI")))

XXX!
len	supp	dose	dosef
4.2	VC	0.5	LO
11.5	VC	0.5	LO
7.3	VC	0.5	LO
5.8	VC	0.5	LO
6.4	VC	0.5	LO
10.0	VC	0.5	LO
11.2	VC	0.5	LO
11.2	VC	0.5	LO

ggplot(ToothGrowth, aes(dose, len)) + geom_point() + facet_grid(~ supp)

The average length in each of the \(2 \times 3 = 6\) groups is:

toothInt <- ToothGrowth %>% group_by(dose, supp) %>% summarise(val = mean(len))
toothInt

## # A tibble: 6 x 3
## # Groups:   dose [3]
##    dose supp    val
##   <dbl> <fct> <dbl>
## 1   0.5 OJ    13.2 
## 2   0.5 VC     7.98
## 3   1   OJ    22.7 
## 4   1   VC    16.8 
## 5   2   OJ    26.1 
## 6   2   VC    26.1

An interaction plot is a graphical visualization of the group averages, see Fig. (fig:linear:interaction2), where also variability in each group is displayed:

ggplot(ToothGrowth, aes(x = factor(dose), y = len, colour = supp)) +
    geom_boxplot(outlier.shape = 4) +
    geom_point(data = toothInt, aes(y = val)) +
    geom_line(data = toothInt, aes(y = val, group = supp))

Interaction plot for the ToothGrowth data. The average len for each group is a dot. Boxplot outliers are crosses.

2.1 A regression model

We fit a model where dose is regarded as a factor:

tooth2 <- lm(len ~ supp + dosef, data = ToothGrowth)
tidy(tooth2)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    12.5      0.988     12.6  5.49e-18
## 2 suppVC         -3.7      0.988     -3.74 4.29e- 4
## 3 dosefME         9.13     1.21       7.54 4.38e-10
## 4 dosefHI        15.5      1.21      12.8  2.85e-18

Refer to parameters generically as \(\beta_0, \dots, \beta_3\). Then the above corresponds to the model \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i , \] where \(x_2\) is \(1\) when dosef = ME and \(0\) otherwise and \(x_3\) is \(1\) when dose = HI and \(0\) otherwise. The interpretation is straight forward: Think of a dummy variable as a way of switching a parameter on and off:

dose = LO and supp = OJ: only \(\beta_0\) is switched on so \(\beta_0\) is the estimated mean.
dose = LO and supp = VC: \(\beta_0\) and \(\beta_1\) are switched on so the estimated mean is \(\beta_0 + \beta_1\).
dose = ME and supp = OJ: \(\beta_0\) and \(\beta_2\) are switched on so the estimated mean is \(\beta_0 + \beta_2\).
dose = ME and supp = VC: \(\beta_0\), \(\beta_1\) and \(\beta_2\) are switched on so the estimated mean is \(\beta_0 + \beta_1 + \beta_2\).
and so on…

3 Analysis of covariance (ANCOVA) models

Walker (1962) studied the mating songs of male tree crickets. Each wingstroke by a cricket produces a pulse of song, and females may use the number of pulses per second to identify males of the correct species. Walker (1962) wanted to know whether the chirps of the crickets {} and {} had different pulse rates. See for details. He measured the pulse rate of the crickets (variable pps) at a variety of temperatures (temp):

crick <- read.table("data/cricket.txt", header = TRUE)
summary(crick)

##  species       temp           pps        
##  ex :14   Min.   :17.2   Min.   : 44.30  
##  niv:16   1st Qu.:20.8   1st Qu.: 59.17  
##           Median :24.0   Median : 76.15  
##           Mean   :23.6   Mean   : 72.49  
##           3rd Qu.:26.2   3rd Qu.: 84.45  
##           Max.   :30.4   Max.   :101.70

head(crick, 4)

##   species temp  pps
## 1      ex 20.8 67.9
## 2      ex 20.8 65.1
## 3      ex 24.0 77.3
## 4      ex 24.0 78.7

ggplot(crick, aes(temp, pps, color = species)) + geom_point()

Sound of cricke: Pulse per second plotted against temperature for two species of cricks.

The main purpose here is to compare pps (more precisely the mean of pps) for the two species when we take into account that the measurements were taken at different temperatures. Consider this summary of data:

crick %>%
  group_by(species) %>%
  summarise(m_pps = mean(pps), m_temp = mean(temp))

## # A tibble: 2 x 3
##   species m_pps m_temp
##   <fct>   <dbl>  <dbl>
## 1 ex       85.6   25.8
## 2 niv      61.0   21.7

3.1 A regression model

crm2 <- lm(pps ~ species + temp, data = crick)
tidy(crm2)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -7.85     2.76      -2.85 8.36e- 3
## 2 speciesniv     -9.90     0.786    -12.6  8.19e-13
## 3 temp            3.63     0.106     34.4  7.91e-24

Refer to parameters generically as \(\beta_0, \dots, \beta_2\). Then the above corresponds to the model \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i , \]

where \(x_1\) is \(1\) if species=niv and zero otherwise and \(x_2\) is temp.

ggplot(crick, aes(temp, pps, color = species)) +
  geom_point() +
  geom_line(aes(temp, predict(crm2)))

Two parallel regression lines fitted to cricks data.

4 Absolute zero

Theory says that that pressure and temperature of an ideal gas are proportional and that there is a temperature called the absolute zero for which the pressure is also zero. The absolute zero is typically reported to being \(-273.15\) degree celcius.

Paired observations are:

t <- c(21.01, 28.1, 38.0, 49, 60, 69, 82.8)   # Celcius
p <- c(100.3, 103.2, 102, 110, 114, 116, 120) # kilo Pascal

qplot(t, p)

Fit a linear regression model to data where \(p\) is the response and \(t\) is the explanatory variable.
Estimate the temperature \(t_0\) for which the pressure is zero. Is the estimate close to \(-273.15\) degree celcius?
Find the standard error of \(t_0\); find a confidence interval for \(t_0\); are data consistent with \(-273.15\) degree celcius?

Working examples on linear models

Søren Højsgaard

April 17, 2019