This dataset contains weight
and two sizes of 20 potatoes. Weight in grams; size in milimeter. There are two sizes: is the longest length and is the shortest length across a potato.
potatoes <- read.table("./data/potatoes.txt", header = TRUE)
head(potatoes, 4)
## weight length width
## 1 22 38 29
## 2 41 46 34
## 3 24 40 31
## 4 16 33 28
We are interested in modelling how weight
changes when and changes. We say that weight
is the response variable while length
and width
are explanatory variables. A commonly used name for this setting is multiple regression. If only one variable is included as explanatory, then the the setting is called linear regression (or simple linear regression).
Initially, we plot weight against length and width to get an overview of the data, see Fig. (fig:linear:potato1). The relationship between weight and length seems to follow an approximately straight line. Likewise, it the relationship between weight and width appears to have a curvature, which can perhaps be captured by a parabola. However, in both plots it is clear that the relationships are not exact: Points scatter around a straight line and a parabola.
pota <- lm(weight ~ length + width + I(width^2), data = potatoes)
tidy(pota)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.84 7.26 1.22 0.241
## 2 length 0.831 0.135 6.16 0.0000137
## 3 width -2.62 0.570 -4.59 0.000299
## 4 I(width^2) 0.0681 0.0121 5.62 0.0000386
Refer to parameters generically as \(\beta_0, \dots, \beta_3\). Then the above corresponds to the model \[
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i ,
\] where \(x_1\) is length
, \(x_2\) is width
and \(x_3\) is width
-squared.
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods: orange juice (coded as OJ) or ascorbic acid a form of vitamin C and (coded as VC). In the following we shall allow also to regard dose
as a factor with three levels:
ToothGrowth <- ToothGrowth %>%
mutate(dosef = factor(dose, levels = c(0.5, 1, 2),
labels = c("LO", "ME", "HI")))
len | supp | dose | dosef |
---|---|---|---|
4.2 | VC | 0.5 | LO |
11.5 | VC | 0.5 | LO |
7.3 | VC | 0.5 | LO |
5.8 | VC | 0.5 | LO |
6.4 | VC | 0.5 | LO |
10.0 | VC | 0.5 | LO |
11.2 | VC | 0.5 | LO |
11.2 | VC | 0.5 | LO |
ggplot(ToothGrowth, aes(dose, len)) + geom_point() + facet_grid(~ supp)
The average length in each of the \(2 \times 3 = 6\) groups is:
toothInt <- ToothGrowth %>% group_by(dose, supp) %>% summarise(val = mean(len))
toothInt
## # A tibble: 6 x 3
## # Groups: dose [3]
## dose supp val
## <dbl> <fct> <dbl>
## 1 0.5 OJ 13.2
## 2 0.5 VC 7.98
## 3 1 OJ 22.7
## 4 1 VC 16.8
## 5 2 OJ 26.1
## 6 2 VC 26.1
An interaction plot is a graphical visualization of the group averages, see Fig. (fig:linear:interaction2), where also variability in each group is displayed:
ggplot(ToothGrowth, aes(x = factor(dose), y = len, colour = supp)) +
geom_boxplot(outlier.shape = 4) +
geom_point(data = toothInt, aes(y = val)) +
geom_line(data = toothInt, aes(y = val, group = supp))
We fit a model where dose is regarded as a factor:
tooth2 <- lm(len ~ supp + dosef, data = ToothGrowth)
tidy(tooth2)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.5 0.988 12.6 5.49e-18
## 2 suppVC -3.7 0.988 -3.74 4.29e- 4
## 3 dosefME 9.13 1.21 7.54 4.38e-10
## 4 dosefHI 15.5 1.21 12.8 2.85e-18
Refer to parameters generically as \(\beta_0, \dots, \beta_3\). Then the above corresponds to the model \[
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i ,
\] where \(x_2\) is \(1\) when dosef = ME
and \(0\) otherwise and \(x_3\) is \(1\) when dose = HI
and \(0\) otherwise. The interpretation is straight forward: Think of a dummy variable as a way of switching a parameter on and off:
dose = LO
and supp = OJ
: only \(\beta_0\) is switched on so \(\beta_0\) is the estimated mean.dose = LO
and supp = VC
: \(\beta_0\) and \(\beta_1\) are switched on so the estimated mean is \(\beta_0 + \beta_1\).dose = ME
and supp = OJ
: \(\beta_0\) and \(\beta_2\) are switched on so the estimated mean is \(\beta_0 + \beta_2\).dose = ME
and supp = VC
: \(\beta_0\), \(\beta_1\) and \(\beta_2\) are switched on so the estimated mean is \(\beta_0 + \beta_1 + \beta_2\).Walker (1962) studied the mating songs of male tree crickets. Each wingstroke by a cricket produces a pulse of song, and females may use the number of pulses per second to identify males of the correct species. Walker (1962) wanted to know whether the chirps of the crickets {} and {} had different pulse rates. See for details. He measured the pulse rate of the crickets (variable pps
) at a variety of temperatures (temp
):
crick <- read.table("data/cricket.txt", header = TRUE)
summary(crick)
## species temp pps
## ex :14 Min. :17.2 Min. : 44.30
## niv:16 1st Qu.:20.8 1st Qu.: 59.17
## Median :24.0 Median : 76.15
## Mean :23.6 Mean : 72.49
## 3rd Qu.:26.2 3rd Qu.: 84.45
## Max. :30.4 Max. :101.70
head(crick, 4)
## species temp pps
## 1 ex 20.8 67.9
## 2 ex 20.8 65.1
## 3 ex 24.0 77.3
## 4 ex 24.0 78.7
ggplot(crick, aes(temp, pps, color = species)) + geom_point()
The main purpose here is to compare pps
(more precisely the mean of pps
) for the two species when we take into account that the measurements were taken at different temperatures. Consider this summary of data:
crick %>%
group_by(species) %>%
summarise(m_pps = mean(pps), m_temp = mean(temp))
## # A tibble: 2 x 3
## species m_pps m_temp
## <fct> <dbl> <dbl>
## 1 ex 85.6 25.8
## 2 niv 61.0 21.7
crm2 <- lm(pps ~ species + temp, data = crick)
tidy(crm2)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -7.85 2.76 -2.85 8.36e- 3
## 2 speciesniv -9.90 0.786 -12.6 8.19e-13
## 3 temp 3.63 0.106 34.4 7.91e-24
Refer to parameters generically as \(\beta_0, \dots, \beta_2\). Then the above corresponds to the model \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i , \]
where \(x_1\) is \(1\) if species=niv
and zero otherwise and \(x_2\) is temp
.
ggplot(crick, aes(temp, pps, color = species)) +
geom_point() +
geom_line(aes(temp, predict(crm2)))
Theory says that that pressure and temperature of an ideal gas are proportional and that there is a temperature called the absolute zero for which the pressure is also zero. The absolute zero is typically reported to being \(-273.15\) degree celcius.
Paired observations are:
t <- c(21.01, 28.1, 38.0, 49, 60, 69, 82.8) # Celcius
p <- c(100.3, 103.2, 102, 110, 114, 116, 120) # kilo Pascal
qplot(t, p)
Fit a linear regression model to data where \(p\) is the response and \(t\) is the explanatory variable.
Estimate the temperature \(t_0\) for which the pressure is zero. Is the estimate close to \(-273.15\) degree celcius?
Find the standard error of \(t_0\); find a confidence interval for \(t_0\); are data consistent with \(-273.15\) degree celcius?