Chapter3-Section6-Linear-Regression-Lab

Linear Regression Lab (3.6)

This is a Python-based walkthough of the lab in Section 3.6 of ISL. You should absolutely read the text and consult this document for how to do the same things in Python.

There are several major topics in this lab:

  • Linear regression: fitting, prediction, confidence intervals.
  • Quality of fit: fitted values, residuals, graphical examination.
  • Multiple regression, includes R^2, VIF.
  • Regression with cross-terms and powers.
  • Minor but important idea: ANOVA to decide which set of predictors to use.
  • Regression with categorical / qualitative variables.

The biggest difference between the use of R and Python is that R gives you a quick graphical summary of every regression with just one command. I have not yet written a similar command for Python (but it could be done) (should I?).

In [2]:
# !pip install seaborn==0.9.0
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

import warnings
warnings.simplefilter('ignore',FutureWarning)

3.6.1 Load Libraries

In [69]:
import sklearn.datasets
boston_all = sklearn.datasets.load_boston()
boston = pd.DataFrame(boston_all.data, 
                      columns=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
                               'ptratio', 'black', 'lstat'])
boston_independent = boston.columns
boston['medv'] = boston_all.target
boston_dependent = ['medv']

3.6.2a Simple Linear Regression

In [4]:
boston.columns
Out[4]:
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat', 'medv'],
      dtype='object')

The boston_all.DESCR field contains the description of columns.

In [5]:
#print(boston_all.DESCR) # commented out so I do not see all of the fields every time I run
In [6]:
lm_fit = smf.ols('medv ~ lstat', data=boston).fit()

In the book they write about "attaching" a data set. As far as I know there is no parallel in Python.

In [7]:
lm_fit.summary()
Out[7]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.544
Model: OLS Adj. R-squared: 0.543
Method: Least Squares F-statistic: 601.6
Date: Fri, 05 Oct 2018 Prob (F-statistic): 5.08e-88
Time: 20:01:07 Log-Likelihood: -1641.5
No. Observations: 506 AIC: 3287.
Df Residuals: 504 BIC: 3295.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 34.5538 0.563 61.415 0.000 33.448 35.659
lstat -0.9500 0.039 -24.528 0.000 -1.026 -0.874
Omnibus: 137.043 Durbin-Watson: 0.892
Prob(Omnibus): 0.000 Jarque-Bera (JB): 291.373
Skew: 1.453 Prob(JB): 5.36e-64
Kurtosis: 5.319 Cond. No. 29.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Notice that the $P>|t|$ column is not in scientific notation,, so $10^{-16}$ renders as $0.000$. (Verify...)

The directory of all possible functions for a fit object is really huge, so I'm not printing it out. The ones with underscores at the start are "internal".

In [8]:
#dir(lm_fit)
In [9]:
lm_fit.params
Out[9]:
Intercept    34.553841
lstat        -0.950049
dtype: float64

Confidence intervals for the predictions are available with the conf_int command. This gives 95% confidence intervals (2.5% on each side, symmetric).

In [10]:
lm_fit.conf_int()
Out[10]:
0 1
Intercept 33.448457 35.659225
lstat -1.026148 -0.873951
In [11]:
x_for_predict = np.array([5,10,15])
x_df = pd.DataFrame(x_for_predict, columns=["lstat"]);
lm_fit.predict(x_df)
Out[11]:
0    29.803594
1    25.053347
2    20.303101
dtype: float64

If you want all of the prediction information, get a prediction data frame. I believe that the mean_ci is confidence for the mean, obs_ci is the confidence for the prediction. (Please let me know if this information is wrong!)

In [12]:
predict_df = lm_fit.get_prediction(x_df)
predict_df.summary_frame()
Out[12]:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 29.803594 0.405247 29.007412 30.599776 17.565675 42.041513
1 25.053347 0.294814 24.474132 25.632563 12.827626 37.279068
2 20.303101 0.290893 19.731588 20.874613 8.077742 32.528459

Notes on statistics vocabulary

Population parameter is a is the "true value" of the underlying trait of the population. Example: average height is a population parameter. You give an estimate of the average height along with a confidence interval.

The term 95% confidence interval means that if the process were repeated with randomly sampled data, 95% of the resulting intervals would contain the population parameter.

I think this also means that it is 95% likely that the given interval contains the true value of the population parameter.

Warning: the confidence interval gives no information about the distribution of the data. Of course, because a lot of samples will shrink the confidence interval no matter what the distribution.

The prediction confidence interval (obs in the table above) gives the interval that is 95% likely to contain true value of the prediction. (E.g., based on the known errors for the linear regression formula that you used.)

Decent explanation, including tolerance interval where you are 95% confident that 99% of the population is within an interval.

Linear model plot - does the regression fitting for you.

In [13]:
sns.lmplot(x='lstat', y='medv', data=boston);
In [14]:
# Joint plot is a little fancier, shows the distributions of each axis as well.
# sns.jointplot(x='lstat', y='medv', kind='reg', data=boston);

The lab explains how to draw additional lines of any slope on a graph. If that's what you are doing, you should probably be working directly with Matplotlib instead of Seaborn. If you want different shaped markers, use the scatterplot option markers.

3.6.2b Statistical Summary Plots

These plots are probably copy-and-paste jobs. I will try to show the simplest way to get each (from statsmodels) as well as linking the more "beautiful" ones I found. There is a good reference for making R-like plots in Python - it shows where to find all of the data you need.8 8Random notes: Seaborn could use another level of abstraction so you can provide the results from your own regression to the residual-plotting function.

I. Residuals and predicted (or fitted) values

The model predicts a y value for each x value. The predicted y value is called the fitted value. There is one predicted y value for each x value in the data set. (Fitted value is $\hat{y}$.)

The difference between a point's y value and the y-value predicted for that point's given x is called the residual. (Residual is $y - \hat{y}$.)

Both of these are very commonly used. You should know how to get them out of the linear model (fittedvalues and resid).

Option 1: Full Seaborn

Seaborn will just do the regression and everything for you.

In [28]:
sns.residplot(x='lstat', y='medv', data=boston);

Option 2: Linear Model

Use fitted values and residuals from the linear model (created by ols at the very start of this document).

In [29]:
x_predict = lm_fit.fittedvalues
y_residuals = lm_fit.resid
In [30]:
fig, ax = plt.subplots()
fig = sns.scatterplot(lm_fit.fittedvalues, lm_fit.resid);
ax.set_ylabel("residuals");
ax.set_xlabel("fitted values");
ax.set_title("Residuals vs fitted values");

II Q-Q plot (Quantile-Quantile)

The idea behind a Q-Q plot is to show the quantiles of a normal distribution vs quantiles of the errors. I guess if the errors are normally distributed and the model is perfect, you see a straight line.

The Q-Q plot is supposed to show how close your data is to a normal distribution? (Or approximately - student t-dist?) This data below shows a terrible fit for the model given - percent lower status families is not linearly related to median value of the homes.

Best Option: StatsModels ProbPlot

Just use the StatsModels qqplot function.

In [437]:
sm.qqplot(y_residuals,line='q');

Scale-Location Plot

Just look this one up if you ever need to scan for "heteroskedacity".... or copy the code / plot from the end.

Leverage Plot

What is leverage?

The assumption in working with the data is that the errors at every observation are independent and have the same variance. However, when you make a linear regression through the data, the variance of the errors at each point are not the same.

Maybe we can do the math and maybe not.

Fact for now: $\hat{y} = Hx$ ($H$ is the "hat matrix" which gives the result of the linear regression.) The diagonal entries of the matrix, $h_{ii}$, tell how the variance of the error at observation i relates to the variance of the actual errors. Var(error at observation i) = (true variance of error)$\times (1-h_{ii})$.

Studentized residuals means compensate for the differing variance of the errors at each point by dividing by $\sigma \sqrt(1-h_{ii})$.

Influence plot = studentized residuals vs leverage. The influence plot is supposed to show you if the residuals are behaving as expected (say more...)... how much each point is influencing the resulting regression.

In [26]:
fig, ax = plt.subplots(figsize=(20,20))
fig = sm.graphics.influence_plot(lm_fit, ax=ax)
plt.show()

If you need to track down the high influence or high leverage points by row number, read the resource blog at the start; it will be technical to get the information you need.

3.6.3 Multiple Linear Regression

In [33]:
lm_multiple = smf.ols('medv ~ lstat + age', data=boston).fit()
lm_multiple.summary()
Out[33]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.551
Model: OLS Adj. R-squared: 0.549
Method: Least Squares F-statistic: 309.0
Date: Fri, 05 Oct 2018 Prob (F-statistic): 2.98e-88
Time: 21:02:10 Log-Likelihood: -1637.5
No. Observations: 506 AIC: 3281.
Df Residuals: 503 BIC: 3294.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 33.2228 0.731 45.458 0.000 31.787 34.659
lstat -1.0321 0.048 -21.416 0.000 -1.127 -0.937
age 0.0345 0.012 2.826 0.005 0.011 0.059
Omnibus: 124.288 Durbin-Watson: 0.945
Prob(Omnibus): 0.000 Jarque-Bera (JB): 244.026
Skew: 1.362 Prob(JB): 1.02e-53
Kurtosis: 5.038 Cond. No. 201.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

When you want to use all of the predictors in your model, I like two choices:

  1. Separate the dependent ($X$) and independent ($y$) variables, and use the basic sm.ols method.
  2. Make a formula by joining all of the variables you are interested in together into a string.

Info: To eliminiate a few of the variables, append subtractions to the formula, like "-dis-rad". The subtractions will override the additions in the formula.

In [70]:
multiple_formula_rhs = "+".join(boston_independent)
multiple_formula_rhs
Out[70]:
'crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat'
In [71]:
lm_multiple = smf.ols('medv ~ ' + multiple_formula_rhs, data=boston).fit()
lm_multiple.summary()
Out[71]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Fri, 05 Oct 2018 Prob (F-statistic): 6.95e-135
Time: 22:51:25 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 36.4911 5.104 7.149 0.000 26.462 46.520
crim -0.1072 0.033 -3.276 0.001 -0.171 -0.043
zn 0.0464 0.014 3.380 0.001 0.019 0.073
indus 0.0209 0.061 0.339 0.735 -0.100 0.142
chas 2.6886 0.862 3.120 0.002 0.996 4.381
nox -17.7958 3.821 -4.658 0.000 -25.302 -10.289
rm 3.8048 0.418 9.102 0.000 2.983 4.626
age 0.0008 0.013 0.057 0.955 -0.025 0.027
dis -1.4758 0.199 -7.398 0.000 -1.868 -1.084
rad 0.3057 0.066 4.608 0.000 0.175 0.436
tax -0.0123 0.004 -3.278 0.001 -0.020 -0.005
ptratio -0.9535 0.131 -7.287 0.000 -1.211 -0.696
black 0.0094 0.003 3.500 0.001 0.004 0.015
lstat -0.5255 0.051 -10.366 0.000 -0.625 -0.426
Omnibus: 178.029 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 782.015
Skew: 1.521 Prob(JB): 1.54e-170
Kurtosis: 8.276 Cond. No. 1.51e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [56]:
lm_multiple.rsquared
Out[56]:
0.7406077428649428

The book mentions a sequence of error estimators that can be used to select a good number of explanatory variables to use.

  • AIC
  • BIC
  • Adjusted $R^2$
  • $C_p$: Not included in the OLS results.

We will learn more about them in Chapter 6.

In [60]:
print(lm_multiple.aic)
print(lm_multiple.bic)
print(lm_multiple.rsquared_adj)
3025.6767200074596
3084.848233377484
0.7337538824121872

Mean squared error of the residuals = sum of squares divided by "residual degrees of freedom". Note that this is not MSE from the text (where you always divide by the number of observations).

In [59]:
lm_multiple.mse_resid
Out[59]:
22.52088675640218

The text mentions $VIF$, the variance inflation factor.

In [77]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

vif(boston[boston_independent].values,1)
Out[77]:
2.8438903527570782

DANGER: Mistake mistake mistake!! Keep reading.

In [122]:
# WRONG WAY!! DO NOT COPY!!
vif_data = [vif(boston[boston_independent].dropna().values, idx) for idx in range(len(boston_independent))]
vif_df = pd.DataFrame(vif_data, index = boston_independent)
vif_df = vif_df.transpose()
# i did not have any luck changing to making columns 
vif_df
Out[122]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat
0 2.074626 2.84389 14.484283 1.152891 73.902212 77.934969 21.386774 14.699368 15.154742 61.226929 85.027314 20.066007 11.088865

If you look you will see the numbers do not agree with the ones from the R package... I am assuming the method used is slightly different. The value for indus and nox are really different... better check on this.

Solution: The "design matrix" needs to include a constant term. This would have been done automatically if we were using a design matrix generator like the statsmodels.formula API (which uses the patsy package internally). Lots of gotchas!

TODO: Revise this section so you generate the formulas through the API.

In [142]:
bconstant = boston.copy()
bconstant['constant'] = 1
bc_indices = boston_independent.append(pd.Index(['constant']))
vif_data = [vif(bconstant[bc_indices].dropna().values, idx) for idx in range(len(bc_indices))]
vif_df = pd.DataFrame([vif_data], columns = bc_indices)
# Notice the list of lists as input so the list is understood as one row of an Nx1 matrix
vif_df
Out[142]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat constant
0 1.773321 2.298641 3.991194 1.073943 4.395064 1.934161 3.10086 3.956551 7.480539 9.008472 1.79922 1.345832 2.938127 585.42521

R has an "update" function that lets you modify the formula but keep the same other inputs to a regression. That does not seem like an important feature, and it does not make sense the way StatsModels works, so just run the regression again!

3.6.4 Interaction Terms

In [143]:
smf.ols('medv ~ lstat * age', data=boston).fit().summary()
Out[143]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.556
Model: OLS Adj. R-squared: 0.553
Method: Least Squares F-statistic: 209.3
Date: Sat, 06 Oct 2018 Prob (F-statistic): 4.86e-88
Time: 08:37:21 Log-Likelihood: -1635.0
No. Observations: 506 AIC: 3278.
Df Residuals: 502 BIC: 3295.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 36.0885 1.470 24.553 0.000 33.201 38.976
lstat -1.3921 0.167 -8.313 0.000 -1.721 -1.063
age -0.0007 0.020 -0.036 0.971 -0.040 0.038
lstat:age 0.0042 0.002 2.244 0.025 0.001 0.008
Omnibus: 135.601 Durbin-Watson: 0.965
Prob(Omnibus): 0.000 Jarque-Bera (JB): 296.955
Skew: 1.417 Prob(JB): 3.29e-65
Kurtosis: 5.461 Cond. No. 6.88e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.88e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

3.6.5 Non-linear transformations of the predictors

StatsModels, you get the second power of lstat in a regression by writing I(lstat**2).

R-compatible behavior in StatsModels/patsy is unexpected and hence annoying. Patsy says it supports ** as the exponentiation operator, but it's complicated. Really you can (need to!) write almost exactly the same kind of formula as in R, including the I, so I(var**2). Except that in R you write I(var^2) ... ugh.

  • Do use I(var**power) for exponential.
  • Do not use ^ that is bitwise or.
  • You do not automatically get lower powers! (See below for polynomials.)
In [169]:
lm_fit2 = smf.ols('medv ~ 1 + lstat + I(lstat**2)', data=boston).fit()
lm_fit2.summary()
Out[169]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.641
Model: OLS Adj. R-squared: 0.639
Method: Least Squares F-statistic: 448.5
Date: Sat, 06 Oct 2018 Prob (F-statistic): 1.56e-112
Time: 09:00:54 Log-Likelihood: -1581.3
No. Observations: 506 AIC: 3169.
Df Residuals: 503 BIC: 3181.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 42.8620 0.872 49.149 0.000 41.149 44.575
lstat -2.3328 0.124 -18.843 0.000 -2.576 -2.090
I(lstat ** 2) 0.0435 0.004 11.628 0.000 0.036 0.051
Omnibus: 107.006 Durbin-Watson: 0.921
Prob(Omnibus): 0.000 Jarque-Bera (JB): 228.388
Skew: 1.128 Prob(JB): 2.55e-50
Kurtosis: 5.397 Cond. No. 1.13e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Analysis of variance performs statistical testing to quantify how much better the second model is than the first one. (Etc, works with more models.) You could spend quite a bit of time reading about how this works if you wanted to.

Analysis of variance technical note: give the models in order of increasing complexity. Not symmetric! The warnings you see are fine - where you see NaN in the table, R just leaves blanks.

In [176]:
lm_fit = smf.ols('medv ~ lstat', data=boston).fit()
sm.stats.anova_lm(lm_fit, lm_fit2)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
Out[176]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 504.0 19472.381418 0.0 NaN NaN NaN
1 503.0 15347.243158 1.0 4125.13826 135.199822 7.630116e-28

Read the book - a probability of $10^{-28}$ means it is virtually certain that the quadratic model is better than the linear one.

There is no poly function in StatsModels / patsy, and you can not cheat by exponentiating (1 + var), so write one yourself.

In [206]:
def patsy_poly(var,n):
    """Return a string: var+I(var**2)+I(var**3)+...+I(var**n)."""
    if n == 0:
        return "1" # should be "1"?
    elif n==1:
        return var
    else:
        return  patsy_poly(var,n-1) + "+I({}**{})".format(var,n)
In [196]:
#import patsy
#patsy.dmatrix('~' + patsy_poly('lstat',3), boston)
In [200]:
smf.ols("medv ~ " + patsy_poly('lstat',5), data=boston).fit().summary()
Out[200]:
OLS Regression Results
Dep. Variable: medv R-squared: 0.682
Model: OLS Adj. R-squared: 0.679
Method: Least Squares F-statistic: 214.2
Date: Sat, 06 Oct 2018 Prob (F-statistic): 8.73e-122
Time: 09:22:13 Log-Likelihood: -1550.6
No. Observations: 506 AIC: 3113.
Df Residuals: 500 BIC: 3139.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 67.6997 3.604 18.783 0.000 60.618 74.781
lstat -11.9911 1.526 -7.859 0.000 -14.989 -8.994
I(lstat ** 2) 1.2728 0.223 5.703 0.000 0.834 1.711
I(lstat ** 3) -0.0683 0.014 -4.747 0.000 -0.097 -0.040
I(lstat ** 4) 0.0017 0.000 4.143 0.000 0.001 0.003
I(lstat ** 5) -1.632e-05 4.42e-06 -3.692 0.000 -2.5e-05 -7.63e-06
Omnibus: 144.085 Durbin-Watson: 0.987
Prob(Omnibus): 0.000 Jarque-Bera (JB): 494.545
Skew: 1.292 Prob(JB): 4.08e-108
Kurtosis: 7.096 Cond. No. 1.37e+08


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.37e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

Let's use anova to analyze a sequence of models where you keep adding powers of lstat, polynomials from degree 1 to 7.

In [207]:
manymodels = \
[ smf.ols("medv ~ " + patsy_poly('lstat',k), data=boston).fit()
  for k in range(8) ]
sm.stats.anova_lm(*manymodels)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/usr/local/miniconda3/envs/bigconda/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
Out[207]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 505.0 42716.295415 0.0 NaN NaN NaN
1 504.0 19472.381418 1.0 23243.913997 854.221354 1.427644e-110
2 503.0 15347.243158 1.0 4125.138260 151.600165 1.236308e-30
3 502.0 14615.481262 1.0 731.761897 26.892486 3.124916e-07
4 501.0 13967.690615 1.0 647.790646 23.806516 1.433576e-06
5 500.0 13597.035027 1.0 370.655588 13.621713 2.481525e-04
6 499.0 13554.671158 1.0 42.363869 1.556886 2.127070e-01
7 498.0 13550.901204 1.0 3.769954 0.138547 7.098881e-01

Notice that the likelihood that model 6 (which is a degree 6 polynomial) is not a significant improvement on the degree five polynomnial (chance of the results being better (if they are) by chance is 20%).

3.6.6 Qualitative Predictors

In [214]:
carseats = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/ISLR/Carseats.csv');
carseats.drop(columns='Unnamed: 0', inplace=True);
In [215]:
carseats.head()
Out[215]:
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
1 11.22 111 48 16 260 83 Good 65 10 Yes Yes
2 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
3 7.40 117 100 4 466 97 Medium 55 14 Yes Yes
4 4.15 141 64 3 340 128 Bad 38 13 Yes No
In [218]:
carseats_independent = set(carseats.columns) - set(['Sales'])
carseats_dot = '+'.join(carseats_independent)
In [213]:
carseats.drop(columns='Unnamed: 0').head()
Out[213]:
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
1 11.22 111 48 16 260 83 Good 65 10 Yes Yes
2 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
3 7.40 117 100 4 466 97 Medium 55 14 Yes Yes
4 4.15 141 64 3 340 128 Bad 38 13 Yes No
In [220]:
smf.ols( 'Sales ~ ' + carseats_dot + " + Income:Advertising + Price:Age", data=carseats).fit().summary()
Out[220]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.876
Model: OLS Adj. R-squared: 0.872
Method: Least Squares F-statistic: 210.0
Date: Sat, 06 Oct 2018 Prob (F-statistic): 6.14e-166
Time: 09:39:31 Log-Likelihood: -564.67
No. Observations: 400 AIC: 1157.
Df Residuals: 386 BIC: 1213.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.5756 1.009 6.519 0.000 4.592 8.559
Urban[T.Yes] 0.1402 0.112 1.247 0.213 -0.081 0.361
US[T.Yes] -0.1576 0.149 -1.058 0.291 -0.450 0.135
ShelveLoc[T.Good] 4.8487 0.153 31.724 0.000 4.548 5.149
ShelveLoc[T.Medium] 1.9533 0.126 15.531 0.000 1.706 2.201
CompPrice 0.0929 0.004 22.567 0.000 0.085 0.101
Income 0.0109 0.003 4.183 0.000 0.006 0.016
Education -0.0209 0.020 -1.063 0.288 -0.059 0.018
Advertising 0.0702 0.023 3.107 0.002 0.026 0.115
Population 0.0002 0.000 0.433 0.665 -0.001 0.001
Price -0.1008 0.007 -13.549 0.000 -0.115 -0.086
Age -0.0579 0.016 -3.633 0.000 -0.089 -0.027
Income:Advertising 0.0008 0.000 2.698 0.007 0.000 0.001
Price:Age 0.0001 0.000 0.801 0.424 -0.000 0.000
Omnibus: 1.281 Durbin-Watson: 2.047
Prob(Omnibus): 0.527 Jarque-Bera (JB): 1.147
Skew: 0.129 Prob(JB): 0.564
Kurtosis: 3.050 Cond. No. 1.31e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.31e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Categorical variables have only a few values. They are encoded as boolean variables. (See book.)

The contrasts from R are the variables like "ShelveLoc[T.Good]". I read this one as "ShelveLoc True (1) means Good".

3.6.7 Writing Functions

This guide is written on the assumption that you have some basic knowledge of programming in Python. That means you already know how to write functions.

Notes

Hat Matrix

It would make sense to me to keep the hat matrix after any linear regression.

  • leverage of each point ($h_{ii}$)
  • "externally studentized" meaning estimate standard deviation without the point in question (Wikipedia notes)

Data sources

Datasets distributed with R is a collection of data placed on github by a U of M professor. We will get our data from there from now on!

Appendix: R-style regression summary plots

Need to produce four plots:

  1. Residuals vs Fitted values with lowess
  2. QQ standardized residuals vs normal residuals with best fit (?) line
  3. Sqrt(abs(standardized residual)) vs fitted values
  4. Standardized residuals vs leverage w/ lowess? and cook's distance

This should mostly be a copy of Emre Can's GSOC project to make R graphs in Python.

In [423]:
def plot_add_label(ax, xlabel=None, ylabel=None, title=None):
    if ax == None: return ax
    if xlabel: ax.set_xlabel(xlabel);
    if ylabel: ax.set_ylabel(ylabel);
    if title: ax.set_title(title);
    return ax

def plot_add_lowess(x, y, ax, color='red'):
    if ax == None: raise RuntimeException('Lowess needs axes to work with!')
    smooth_data = sm.nonparametric.lowess(endog=y, exog=x)
    fig = sns.lineplot(x=smooth_data[:,0], y=smooth_data[:,1], color=color, ax=ax)
    return fig

def plot_1_resid_vs_fitted(lm_fit, 
                           ax = None, 
                           xlabel="fitted values",
                           ylabel="residuals",
                           title="Residuals vs fitted values"):
    if ax == None: ax = plt.gca()
    x = lm_fit.fittedvalues
    y = lm_fit.resid
    fig = sns.scatterplot(x=x, y=y, ax=ax);
    xrange = ax.get_xlim()
    zeros = np.zeros(2)
    sns.lineplot(xrange, zeros, color='gray', style = zeros, dashes=[[2,4]], legend=False, ax=ax);
    plot_add_lowess(x,y,ax)
    plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
    return fig
In [424]:
plot_1_resid_vs_fitted(lm_fit);
In [425]:
def plot_2_residuals_vs_normal (lm_fit, 
                                ax = None, 
                                xlabel="Theoretical Quantiles",
                                ylabel="Standardized Residuals",
                                title="Normal Q-Q"):
    y_residuals = lm_fit.get_influence().resid_studentized_internal
    if ax == None: ax = plt.gca()
    fig = sm.qqplot(y_residuals, line='q', ax=ax)
    plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
    return fig
In [426]:
plot_2_residuals_vs_normal(lm_fit);
In [427]:
def plot_3_std_resid_vs_fitted(lm_fit,
                               ax = None, 
                               xlabel="fitted values",
                               ylabel="$\sqrt{|{Standardized residuals}|}$",
                               title="Scale-Location"):
    if ax == None: ax = plt.gca()
    x = lm_fit.fittedvalues
    y = lm_fit.get_influence().resid_studentized_internal
    fig = sns.scatterplot(x=x, y=y, ax=ax)
    plot_add_lowess(x, y, ax)
    plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
    return fig
In [428]:
plot_3_std_resid_vs_fitted(lm_fit);
In [429]:
def plot_4_std_resid_vs_leverage(lm_fit,
                                 ax = None, 
                                 xlabel="Leverage",
                                 ylabel="Standardized residuals",
                                 title="Residuals vs Leverage"):
    if ax == None: ax = plt.gca()
    x = lm_fit.get_influence().hat_matrix_diag  # leverage
    y = lm_fit.get_influence().resid_studentized_internal # studentized residuals
    fig = sns.scatterplot(x=x, y=y, ax=ax)
    plot_add_lowess(x, y, ax)
    plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
    return fig
In [430]:
plot_4_std_resid_vs_leverage(lm_fit);
In [435]:
def plot_lm_summary(lm_fit,
                    title="Linear Model Summary (R-style)",
                    figsize=(15,10)):
    """Produce an R-style summary of a fitted linear model.
    The linear model must be the result of a StatsModel OLS."""
    fig, ax = plt.subplots(nrows=2,ncols=2, figsize=figsize)
    plot_1_resid_vs_fitted(lm_fit, ax=ax[0,0]);
    plot_2_residuals_vs_normal(lm_fit, ax=ax[0,1]);
    plot_3_std_resid_vs_fitted(lm_fit, ax=ax[1,0]);
    plot_4_std_resid_vs_leverage(lm_fit, ax=ax[1,1]);
    if title: fig.suptitle(title, fontsize=18)
    return fig
In [436]:
plot_lm_summary(lm_fit);