Linear regression and correlation

The ASTA team

The regression problem

We want to predict

import pandas as pd

trees = pd.read_csv("https://asta.math.aau.dk/datasets?file=trees.txt", sep='\t')
trees.head()
##    Girth  Height  Volume
## 0    8.3      70    10.3
## 1    8.6      65    10.3
## 2    8.8      63    10.2
## 3   10.5      72    16.4
## 4   10.7      81    18.8

Initial graphics

import seaborn as sns
import matplotlib.pyplot as plt

p = sns.pairplot(trees)

Simple linear regression

p = sns.lmplot(x='Girth', y='Volume', data=trees, ci=None)

Model for linear regression

Least squares

The prediction equation and residuals

Estimation of conditional standard deviation

Example in R

import numpy as np
import statsmodels.formula.api as smf

model = smf.ols('Volume ~ Girth', data=trees).fit()
model.summary(slim = True) # text output
OLS Regression Results
Dep. Variable: Volume R-squared: 0.935
Model: OLS Adj. R-squared: 0.933
No. Observations: 31 F-statistic: 419.4
Covariance Type: nonrobust Prob (F-statistic): 8.64e-19
coef std err t P>|t| [0.025 0.975]
Intercept -36.9435 3.365 -10.978 0.000 -43.826 -30.061
Girth 5.0659 0.247 20.478 0.000 4.560 5.572


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[model.resid.min(), model.resid.max(), model.resid.median()]
## [np.float64(-8.065359510665438), np.float64(9.586816813996933), np.float64(0.15196668471900665)]
np.sqrt(model.mse_resid)
## np.float64(4.2519875217291965)

Test for independence

Example

coef_table = model.summary2().tables[1] # data output
coef_table
##                Coef.  Std.Err.          t         P>|t|     [0.025     0.975]
## Intercept -36.943459  3.365145 -10.978267  7.621449e-12 -43.825953 -30.060965
## Girth       5.065856  0.247377  20.478288  8.644334e-19   4.559914   5.571799
np.sqrt(model.mse_resid)
## np.float64(4.2519875217291965)
model.df_resid
## np.float64(29.0)

Confidence interval for slope

model.conf_int(alpha = 0.05)
##                    0          1
## Intercept -43.825953 -30.060965
## Girth       4.559914   5.571799

Correlation

trees.corr()
##            Girth   Height    Volume
## Girth   1.000000  0.51928  0.967119
## Height  0.519280  1.00000  0.598250
## Volume  0.967119  0.59825  1.000000
trees[['Height', 'Girth', 'Volume']].corr()
##          Height     Girth    Volume
## Height  1.00000  0.519280  0.598250
## Girth   0.51928  1.000000  0.967119
## Volume  0.59825  0.967119  1.000000

which produces the same output as above.

trees['Height'].corr(trees['Volume'])
## np.float64(0.5982496519917821)

R-squared: Reduction in prediction error

R-squared: Reduction in prediction error

Graphical illustration of sums of squares

\(r^2\): Reduction in prediction error

model.summary(slim = True)
OLS Regression Results
Dep. Variable: Volume R-squared: 0.935
Model: OLS Adj. R-squared: 0.933
No. Observations: 31 F-statistic: 419.4
Covariance Type: nonrobust Prob (F-statistic): 8.64e-19
coef std err t P>|t| [0.025 0.975]
Intercept -36.9435 3.365 -10.978 0.000 -43.826 -30.061
Girth 5.0659 0.247 20.478 0.000 4.560 5.572


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.