Multiple linear regression

The ASTA team

Multiple regression model

Multiple regression model

Example

import pandas as pd

FL = pd.read_csv("https://asta.math.aau.dk/datasets?file=fl-crime.txt", sep='\t')
FL.head(n = 3)
##    Crime  Education  Urbanisation
## 0    104       82.7          73.2
## 1     20       64.1          21.5
## 2     64       74.7          85.0
import seaborn as sns
import matplotlib.pyplot as plt

p = sns.pairplot(FL)

Correlations

FL.corr()
##                  Crime  Education  Urbanisation
## Crime         1.000000   0.466912      0.677368
## Education     0.466912   1.000000      0.790719
## Urbanisation  0.677368   0.790719      1.000000
import pingouin as pg

pg.corr(FL['Crime'], FL['Education'], method='pearson')
##           n         r         CI95%     p-val     BF10     power
## pearson  67  0.466912  [0.26, 0.64]  0.000068  357.897  0.982794

Several predictors

Example

import numpy as np
import statsmodels.formula.api as smf

model = smf.ols('Crime ~ Education + Urbanisation', data=FL).fit()
model.summary(slim = True)
OLS Regression Results
Dep. Variable: Crime R-squared: 0.471
Model: OLS Adj. R-squared: 0.455
No. Observations: 67 F-statistic: 28.54
Covariance Type: nonrobust Prob (F-statistic): 1.38e-09
coef std err t P>|t| [0.025 0.975]
Intercept 59.1181 28.365 2.084 0.041 2.452 115.784
Education -0.5834 0.472 -1.235 0.221 -1.527 0.360
Urbanisation 0.6825 0.123 5.539 0.000 0.436 0.929


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Simpsons paradox

The general model

Regression model

Interpretation of parameters

Estimation

Estimation of model

Multiple R-squared

Multiple \(R^2\)

FL['CrimePred'] = model.fittedvalues
r_val = model.rsquared**0.5
sns.regplot(x='Crime', y='CrimePred', data=FL, ci=None)

# Labels and title
plt.title(f"Correlation between predicted and observed y (r = {r_val:.2f})")
plt.xlabel("Crime")
plt.ylabel(r'$\hat{y}$') #r for raw string, no need to escape backslash in math

Example

model.summary(slim = True)
OLS Regression Results
Dep. Variable: Crime R-squared: 0.471
Model: OLS Adj. R-squared: 0.455
No. Observations: 67 F-statistic: 28.54
Covariance Type: nonrobust Prob (F-statistic): 1.38e-09
coef std err t P>|t| [0.025 0.975]
Intercept 59.1181 28.365 2.084 0.041 2.452 115.784
Education -0.5834 0.472 -1.235 0.221 -1.527 0.360
Urbanisation 0.6825 0.123 5.539 0.000 0.436 0.929


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
np.sqrt(model.mse_resid)
## np.float64(20.81558240979863)
model.df_resid
## np.float64(64.0)

Example

model2 = smf.ols('Crime ~ Urbanisation', data=FL).fit()
model2.summary(slim = True)
OLS Regression Results
Dep. Variable: Crime R-squared: 0.459
Model: OLS Adj. R-squared: 0.451
No. Observations: 67 F-statistic: 55.11
Covariance Type: nonrobust Prob (F-statistic): 3.08e-10
coef std err t P>|t| [0.025 0.975]
Intercept 24.5412 4.539 5.406 0.000 15.476 33.607
Urbanisation 0.5622 0.076 7.424 0.000 0.411 0.713


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

F-test for effect of predictors

F-test

Example

from scipy.stats import f

1 - f.cdf(28.54, 2, 64)
## np.float64(1.3786117802894182e-09)
model.summary(slim = True)
OLS Regression Results
Dep. Variable: Crime R-squared: 0.471
Model: OLS Adj. R-squared: 0.455
No. Observations: 67 F-statistic: 28.54
Covariance Type: nonrobust Prob (F-statistic): 1.38e-09
coef std err t P>|t| [0.025 0.975]
Intercept 59.1181 28.365 2.084 0.041 2.452 115.784
Education -0.5834 0.472 -1.235 0.221 -1.527 0.360
Urbanisation 0.6825 0.123 5.539 0.000 0.436 0.929


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Test for interaction

Interaction between effects of predictors

model3 = smf.ols('Crime ~ Education * Urbanisation', data=FL).fit()
model3.summary(slim = True)
OLS Regression Results
Dep. Variable: Crime R-squared: 0.479
Model: OLS Adj. R-squared: 0.454
No. Observations: 67 F-statistic: 19.32
Covariance Type: nonrobust Prob (F-statistic): 5.37e-09
coef std err t P>|t| [0.025 0.975]
Intercept 19.3175 49.959 0.387 0.700 -80.517 119.152
Education 0.0340 0.794 0.043 0.966 -1.552 1.620
Urbanisation 1.5143 0.868 1.744 0.086 -0.220 3.249
Education:Urbanisation -0.0120 0.012 -0.968 0.337 -0.037 0.013


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.97e+04. This might indicate that there are
strong multicollinearity or other numerical problems.