1 Introduction

The standard simple linear regression model \[ y_i = \alpha + \beta x_i + e_i, \quad i = 1, 2, \ldots, n \] assumes that the errors, the \(e_i\)’s, are i.i.d. as \(e_i \sim N(0, \sigma^2)\).

Here we discuss what happens if this assumption is violated. Namely, we consider what happens when there is correlation between the \(e_i\)’s. This can happen in many situations, e.g. in time series.

2 Correlation between errors

Assume that the errors follow an auto regression \[ e_i = \phi e_{i-1} + u_i, \quad i = 1, 2, \ldots, n \] where \(|\phi| < 1\), \(e_1 = u_1\), and \(u_i\)’s are i.i.d. \(N(0, \sigma^2)\) for \(i = 1, 2, \ldots, n\).

Note that we here assume time dependence and ordering via the \(i\) subscript. Above, \(e_i\) is an AR(1) process.

Take-home message: If we analyse data as a linear regression model, the parameter estimates are unbiased (they are ok, although we could do better). However, the standard errors of the estimates can be very misleading which can lead to overly confident conclusions.

3 Handling the correlations.

Suppose \(\phi\) is known. To make inference in such a situation, consider the difference \(z_i = y_i - \phi y_{i-1}\) and note that \[\begin{align} z_i &= y_i - \phi y_{i-1} \\ &= (\alpha + \beta x_i + e_i) - \phi (\alpha + \beta x_{i-1} + e_{i-1}) \\ &= (\alpha + \beta x_i + e_i) - \phi (\alpha + \beta x_{i-1} + e_{i-1}) \\ &= (1-\phi) \alpha + (x_i - \phi x_{i-1}) \beta + e_i - \phi e_{i-1} \\ &= (1-\phi) \alpha + (x_i - \phi x_{i-1}) \beta + u_i . \end{align}\]

Thus, instead of regressing \(y_i\) on \(x_i\), we regress the \(y_i - \phi y_{i-1}\) on \(x_i - \phi x_{i-1}\), where we thus assume that we know \(\phi\). Note that we get one less observation.

Note also that the intercept has a factor of \(1-\phi\), which must be taken into account.

If we do not know \(\phi\) we can estimate it by first regressing \(y_i\) on \(x_i\) to obtain residuals \(r_1, r_2, \ldots, r_n\), and the estimate \(\phi\) by the correlation between \((r_i, r_{i-1})\) for \(i = 2, 3, \ldots, n\). With this estimate we can go through the steps above, so we end up with a two-stage procedure. (There are other alternatives to such a procedure, but we shall not go into detail here).

4 Simulation

Assume that \(x_i = i\).

n <- 100
alpha <- 5
beta <- 1
sigma <- 2
phi <- 0.8

x <- 1:n
set.seed(1)
u <- rnorm(n, mean = 0, sd = sigma)

e <- numeric(n)
e[1] <- u[1]
for (i in 2:n) {
  e[i] <- phi * e[i - 1] + u[i]
}

y <- alpha + beta*x + e
qplot(x, y) + geom_line()

m1 <- lm(y ~ x)
summary(m1) via broom.
term estimate std.error
(Intercept) 5.9 0.4880
x 1.0 0.0084

Estimate \(\phi\) and fit a model using the estimate (m2):

r <- residuals(m1)
phi_est <- cor(tail(r, n - 1), head(r, n - 1))
phi_est
## [1] 0.6757454
z <- tail(y, n - 1) - phi_est*head(y, n - 1)
x_new <- tail(x, n - 1) - phi_est*head(x, n - 1)
m2 <- lm(z ~ x_new)
summary(m2) via broom. Note that (Intercept) theoretically is a factor of \(1-\phi\) different from the (Intercept) of m1.
term estimate std.error
(Intercept) 2 0.379
x_new 1 0.019

Assume \(\phi\) is known and fit a model using the estimate (m3):

summary(m3) via broom. Note that (Intercept) theoretically is a factor of \(1-\phi\) different from the (Intercept) of m1.
term estimate std.error
(Intercept) 1.33 0.394
x_new 0.99 0.032

We can do this many times (\(M = 100\)) and inspect the estimate of \(\phi\) and the coverage of the respective confidence intervals.

4.1 Intercept (\(\alpha\))

method coverage
m1 (i.i.d.) 0.41
m2 (est. phi) 0.73
m3 (known phi) 0.88

4.2 Slope (\(\beta\))

method coverage
m1 (i.i.d.) 0.44
m2 (est. phi) 0.85
m3 (known phi) 0.93