The standard simple linear regression model \[ y_i = \alpha + \beta x_i + e_i, \quad i = 1, 2, \ldots, n \] assumes that the errors, the \(e_i\)’s, are i.i.d. as \(e_i \sim N(0, \sigma^2)\).
Here we discuss what happens if this assumption is violated. Namely, we consider what happens when there is correlation between the \(e_i\)’s. This can happen in many situations, e.g. in time series.
Assume that the errors follow an auto regression \[ e_i = \phi e_{i-1} + u_i, \quad i = 1, 2, \ldots, n \] where \(|\phi| < 1\), \(e_1 = u_1\), and \(u_i\)’s are i.i.d. \(N(0, \sigma^2)\) for \(i = 1, 2, \ldots, n\).
Note that we here assume time dependence and ordering via the \(i\) subscript. Above, \(e_i\) is an AR(1) process.
Take-home message: If we analyse data as a linear regression model, the parameter estimates are unbiased (they are ok, although we could do better). However, the standard errors of the estimates can be very misleading which can lead to overly confident conclusions.
Thus, instead of regressing \(y_i\) on \(x_i\), we regress the \(y_i - \phi y_{i-1}\) on \(x_i - \phi x_{i-1}\), where we thus assume that we know \(\phi\). Note that we get one less observation.
Note also that the intercept has a factor of \(1-\phi\), which must be taken into account.
If we do not know \(\phi\) we can estimate it by first regressing \(y_i\) on \(x_i\) to obtain residuals \(r_1, r_2, \ldots, r_n\), and the estimate \(\phi\) by the correlation between \((r_i, r_{i-1})\) for \(i = 2, 3, \ldots, n\). With this estimate we can go through the steps above, so we end up with a two-stage procedure. (There are other alternatives to such a procedure, but we shall not go into detail here).
Assume that \(x_i = i\).
n <- 100
alpha <- 5
beta <- 1
sigma <- 2
phi <- 0.8
x <- 1:n
set.seed(1)
u <- rnorm(n, mean = 0, sd = sigma)
e <- numeric(n)
e[1] <- u[1]
for (i in 2:n) {
e[i] <- phi * e[i - 1] + u[i]
}
y <- alpha + beta*x + e
qplot(x, y) + geom_line()
m1 <- lm(y ~ x)
term | estimate | std.error |
---|---|---|
(Intercept) | 5.9 | 0.4880 |
x | 1.0 | 0.0084 |
Estimate \(\phi\) and fit a model using the estimate (m2
):
r <- residuals(m1)
phi_est <- cor(tail(r, n - 1), head(r, n - 1))
phi_est
## [1] 0.6757454
z <- tail(y, n - 1) - phi_est*head(y, n - 1)
x_new <- tail(x, n - 1) - phi_est*head(x, n - 1)
m2 <- lm(z ~ x_new)
term | estimate | std.error |
---|---|---|
(Intercept) | 2 | 0.379 |
x_new | 1 | 0.019 |
Assume \(\phi\) is known and fit a model using the estimate (m3
):
term | estimate | std.error |
---|---|---|
(Intercept) | 1.33 | 0.394 |
x_new | 0.99 | 0.032 |
We can do this many times (\(M = 100\)) and inspect the estimate of \(\phi\) and the coverage of the respective confidence intervals.
method | coverage |
---|---|
m1 (i.i.d.) | 0.41 |
m2 (est. phi) | 0.73 |
m3 (known phi) | 0.88 |
method | coverage |
---|---|
m1 (i.i.d.) | 0.44 |
m2 (est. phi) | 0.85 |
m3 (known phi) | 0.93 |