This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This is a Python-based walkthough of the lab in Section 3.6 of ISL. You should absolutely read the text and consult this document for how to do the same things in Python.
There are several major topics in this lab:
The biggest difference between the use of R and Python is that R gives you a quick graphical summary of every regression with just one command. I have not yet written a similar command for Python (but it could be done) (should I?).
# !pip install seaborn==0.9.0
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore',FutureWarning)
import sklearn.datasets
boston_all = sklearn.datasets.load_boston()
boston = pd.DataFrame(boston_all.data,
columns=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
'ptratio', 'black', 'lstat'])
boston_independent = boston.columns
boston['medv'] = boston_all.target
boston_dependent = ['medv']
boston.columns
The boston_all.DESCR
field contains the description of columns.
#print(boston_all.DESCR) # commented out so I do not see all of the fields every time I run
lm_fit = smf.ols('medv ~ lstat', data=boston).fit()
In the book they write about "attaching" a data set. As far as I know there is no parallel in Python.
lm_fit.summary()
Notice that the $P>|t|$ column is not in scientific notation,, so $10^{-16}$ renders as $0.000$. (Verify...)
The directory of all possible functions for a fit object is really huge, so I'm not printing it out. The ones with underscores at the start are "internal".
#dir(lm_fit)
lm_fit.params
Confidence intervals for the predictions are available with the conf_int
command. This gives 95% confidence intervals (2.5% on each side, symmetric).
lm_fit.conf_int()
x_for_predict = np.array([5,10,15])
x_df = pd.DataFrame(x_for_predict, columns=["lstat"]);
lm_fit.predict(x_df)
If you want all of the prediction information, get a prediction data frame. I believe that the mean_ci
is confidence for the mean, obs_ci
is the confidence for the prediction. (Please let me know if this information is wrong!)
predict_df = lm_fit.get_prediction(x_df)
predict_df.summary_frame()
Population parameter is a is the "true value" of the underlying trait of the population. Example: average height is a population parameter. You give an estimate of the average height along with a confidence interval.
The term 95% confidence interval means that if the process were repeated with randomly sampled data, 95% of the resulting intervals would contain the population parameter.
I think this also means that it is 95% likely that the given interval contains the true value of the population parameter.
Warning: the confidence interval gives no information about the distribution of the data. Of course, because a lot of samples will shrink the confidence interval no matter what the distribution.
The prediction confidence interval (obs
in the table above) gives the interval that is 95% likely to contain true value of the prediction. (E.g., based on the known errors for the linear regression formula that you used.)
Decent explanation, including tolerance interval where you are 95% confident that 99% of the population is within an interval.
Linear model plot - does the regression fitting for you.
sns.lmplot(x='lstat', y='medv', data=boston);
# Joint plot is a little fancier, shows the distributions of each axis as well.
# sns.jointplot(x='lstat', y='medv', kind='reg', data=boston);
The lab explains how to draw additional lines of any slope on a graph. If that's what you are doing, you should probably be working directly with Matplotlib instead of Seaborn. If you want different shaped markers, use the scatterplot
option markers
.
These plots are probably copy-and-paste jobs. I will try to show the simplest way to get each (from statsmodels) as well as linking the more "beautiful" ones I found. There is a good reference for making R-like plots in Python - it shows where to find all of the data you need.8 8Random notes: Seaborn could use another level of abstraction so you can provide the results from your own regression to the residual-plotting function.
The model predicts a y value for each x value. The predicted y value is called the fitted value. There is one predicted y value for each x value in the data set. (Fitted value is $\hat{y}$.)
The difference between a point's y value and the y-value predicted for that point's given x is called the residual. (Residual is $y - \hat{y}$.)
Both of these are very commonly used. You should know how to get them out of the linear model (fittedvalues
and resid
).
Seaborn will just do the regression and everything for you.
sns.residplot(x='lstat', y='medv', data=boston);
Use fitted values and residuals from the linear model (created by ols
at the very start of this document).
x_predict = lm_fit.fittedvalues
y_residuals = lm_fit.resid
fig, ax = plt.subplots()
fig = sns.scatterplot(lm_fit.fittedvalues, lm_fit.resid);
ax.set_ylabel("residuals");
ax.set_xlabel("fitted values");
ax.set_title("Residuals vs fitted values");
The idea behind a Q-Q plot is to show the quantiles of a normal distribution vs quantiles of the errors. I guess if the errors are normally distributed and the model is perfect, you see a straight line.
The Q-Q plot is supposed to show how close your data is to a normal distribution? (Or approximately - student t-dist?) This data below shows a terrible fit for the model given - percent lower status families is not linearly related to median value of the homes.
Just use the StatsModels qqplot
function.
sm.qqplot(y_residuals,line='q');
Just look this one up if you ever need to scan for "heteroskedacity".... or copy the code / plot from the end.
What is leverage?
The assumption in working with the data is that the errors at every observation are independent and have the same variance. However, when you make a linear regression through the data, the variance of the errors at each point are not the same.
Maybe we can do the math and maybe not.
Fact for now: $\hat{y} = Hx$ ($H$ is the "hat matrix" which gives the result of the linear regression.) The diagonal entries of the matrix, $h_{ii}$, tell how the variance of the error at observation i relates to the variance of the actual errors. Var(error at observation i) = (true variance of error)$\times (1-h_{ii})$.
Studentized residuals means compensate for the differing variance of the errors at each point by dividing by $\sigma \sqrt(1-h_{ii})$.
Influence plot = studentized residuals vs leverage. The influence plot is supposed to show you if the residuals are behaving as expected (say more...)... how much each point is influencing the resulting regression.
fig, ax = plt.subplots(figsize=(20,20))
fig = sm.graphics.influence_plot(lm_fit, ax=ax)
plt.show()
If you need to track down the high influence or high leverage points by row number, read the resource blog at the start; it will be technical to get the information you need.
lm_multiple = smf.ols('medv ~ lstat + age', data=boston).fit()
lm_multiple.summary()
When you want to use all of the predictors in your model, I like two choices:
sm.ols
method.Info: To eliminiate a few of the variables, append subtractions to the formula, like "-dis-rad". The subtractions will override the additions in the formula.
multiple_formula_rhs = "+".join(boston_independent)
multiple_formula_rhs
lm_multiple = smf.ols('medv ~ ' + multiple_formula_rhs, data=boston).fit()
lm_multiple.summary()
lm_multiple.rsquared
The book mentions a sequence of error estimators that can be used to select a good number of explanatory variables to use.
We will learn more about them in Chapter 6.
print(lm_multiple.aic)
print(lm_multiple.bic)
print(lm_multiple.rsquared_adj)
Mean squared error of the residuals = sum of squares divided by "residual degrees of freedom". Note that this is not MSE from the text (where you always divide by the number of observations).
lm_multiple.mse_resid
The text mentions $VIF$, the variance inflation factor.
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
vif(boston[boston_independent].values,1)
DANGER: Mistake mistake mistake!! Keep reading.
# WRONG WAY!! DO NOT COPY!!
vif_data = [vif(boston[boston_independent].dropna().values, idx) for idx in range(len(boston_independent))]
vif_df = pd.DataFrame(vif_data, index = boston_independent)
vif_df = vif_df.transpose()
# i did not have any luck changing to making columns
vif_df
If you look you will see the numbers do not agree with the ones from the R package... I am assuming the method used is slightly different. The value for indus
and nox
are really different... better check on this.
Solution: The "design matrix" needs to include a constant term. This would have been done automatically if we were using a design matrix generator like the statsmodels.formula
API (which uses the patsy
package internally). Lots of gotchas!
TODO: Revise this section so you generate the formulas through the API.
bconstant = boston.copy()
bconstant['constant'] = 1
bc_indices = boston_independent.append(pd.Index(['constant']))
vif_data = [vif(bconstant[bc_indices].dropna().values, idx) for idx in range(len(bc_indices))]
vif_df = pd.DataFrame([vif_data], columns = bc_indices)
# Notice the list of lists as input so the list is understood as one row of an Nx1 matrix
vif_df
R has an "update" function that lets you modify the formula but keep the same other inputs to a regression. That does not seem like an important feature, and it does not make sense the way StatsModels works, so just run the regression again!
smf.ols('medv ~ lstat * age', data=boston).fit().summary()
StatsModels, you get the second power of lstat
in a regression by writing I(lstat**2)
.
R-compatible behavior in StatsModels/patsy is unexpected and hence annoying. Patsy says it supports **
as the exponentiation operator, but it's complicated. Really you can (need to!) write almost exactly the same kind of formula as in R, including the I
, so I(var**2)
. Except that in R you write I(var^2)
... ugh.
I(var**power)
for exponential.^
that is bitwise or.lm_fit2 = smf.ols('medv ~ 1 + lstat + I(lstat**2)', data=boston).fit()
lm_fit2.summary()
Analysis of variance performs statistical testing to quantify how much better the second model is than the first one. (Etc, works with more models.) You could spend quite a bit of time reading about how this works if you wanted to.
Analysis of variance technical note: give the models in order of increasing complexity. Not symmetric! The warnings you see are fine - where you see NaN in the table, R just leaves blanks.
lm_fit = smf.ols('medv ~ lstat', data=boston).fit()
sm.stats.anova_lm(lm_fit, lm_fit2)
Read the book - a probability of $10^{-28}$ means it is virtually certain that the quadratic model is better than the linear one.
There is no poly
function in StatsModels / patsy, and you can not cheat by exponentiating (1 + var)
, so write one yourself.
def patsy_poly(var,n):
"""Return a string: var+I(var**2)+I(var**3)+...+I(var**n)."""
if n == 0:
return "1" # should be "1"?
elif n==1:
return var
else:
return patsy_poly(var,n-1) + "+I({}**{})".format(var,n)
#import patsy
#patsy.dmatrix('~' + patsy_poly('lstat',3), boston)
smf.ols("medv ~ " + patsy_poly('lstat',5), data=boston).fit().summary()
Let's use anova to analyze a sequence of models where you keep adding powers of lstat, polynomials from degree 1 to 7.
manymodels = \
[ smf.ols("medv ~ " + patsy_poly('lstat',k), data=boston).fit()
for k in range(8) ]
sm.stats.anova_lm(*manymodels)
Notice that the likelihood that model 6 (which is a degree 6 polynomial) is not a significant improvement on the degree five polynomnial (chance of the results being better (if they are) by chance is 20%).
carseats = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/ISLR/Carseats.csv');
carseats.drop(columns='Unnamed: 0', inplace=True);
carseats.head()
carseats_independent = set(carseats.columns) - set(['Sales'])
carseats_dot = '+'.join(carseats_independent)
carseats.drop(columns='Unnamed: 0').head()
smf.ols( 'Sales ~ ' + carseats_dot + " + Income:Advertising + Price:Age", data=carseats).fit().summary()
Categorical variables have only a few values. They are encoded as boolean variables. (See book.)
The contrasts
from R are the variables like "ShelveLoc[T.Good]
". I read this one as "ShelveLoc True (1) means Good".
This guide is written on the assumption that you have some basic knowledge of programming in Python. That means you already know how to write functions.
It would make sense to me to keep the hat matrix after any linear regression.
Datasets distributed with R is a collection of data placed on github by a U of M professor. We will get our data from there from now on!
Need to produce four plots:
This should mostly be a copy of Emre Can's GSOC project to make R graphs in Python.
def plot_add_label(ax, xlabel=None, ylabel=None, title=None):
if ax == None: return ax
if xlabel: ax.set_xlabel(xlabel);
if ylabel: ax.set_ylabel(ylabel);
if title: ax.set_title(title);
return ax
def plot_add_lowess(x, y, ax, color='red'):
if ax == None: raise RuntimeException('Lowess needs axes to work with!')
smooth_data = sm.nonparametric.lowess(endog=y, exog=x)
fig = sns.lineplot(x=smooth_data[:,0], y=smooth_data[:,1], color=color, ax=ax)
return fig
def plot_1_resid_vs_fitted(lm_fit,
ax = None,
xlabel="fitted values",
ylabel="residuals",
title="Residuals vs fitted values"):
if ax == None: ax = plt.gca()
x = lm_fit.fittedvalues
y = lm_fit.resid
fig = sns.scatterplot(x=x, y=y, ax=ax);
xrange = ax.get_xlim()
zeros = np.zeros(2)
sns.lineplot(xrange, zeros, color='gray', style = zeros, dashes=[[2,4]], legend=False, ax=ax);
plot_add_lowess(x,y,ax)
plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
return fig
plot_1_resid_vs_fitted(lm_fit);
def plot_2_residuals_vs_normal (lm_fit,
ax = None,
xlabel="Theoretical Quantiles",
ylabel="Standardized Residuals",
title="Normal Q-Q"):
y_residuals = lm_fit.get_influence().resid_studentized_internal
if ax == None: ax = plt.gca()
fig = sm.qqplot(y_residuals, line='q', ax=ax)
plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
return fig
plot_2_residuals_vs_normal(lm_fit);
def plot_3_std_resid_vs_fitted(lm_fit,
ax = None,
xlabel="fitted values",
ylabel="$\sqrt{|{Standardized residuals}|}$",
title="Scale-Location"):
if ax == None: ax = plt.gca()
x = lm_fit.fittedvalues
y = lm_fit.get_influence().resid_studentized_internal
fig = sns.scatterplot(x=x, y=y, ax=ax)
plot_add_lowess(x, y, ax)
plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
return fig
plot_3_std_resid_vs_fitted(lm_fit);
def plot_4_std_resid_vs_leverage(lm_fit,
ax = None,
xlabel="Leverage",
ylabel="Standardized residuals",
title="Residuals vs Leverage"):
if ax == None: ax = plt.gca()
x = lm_fit.get_influence().hat_matrix_diag # leverage
y = lm_fit.get_influence().resid_studentized_internal # studentized residuals
fig = sns.scatterplot(x=x, y=y, ax=ax)
plot_add_lowess(x, y, ax)
plot_add_label(ax, xlabel=xlabel, ylabel=ylabel, title=title)
return fig
plot_4_std_resid_vs_leverage(lm_fit);
def plot_lm_summary(lm_fit,
title="Linear Model Summary (R-style)",
figsize=(15,10)):
"""Produce an R-style summary of a fitted linear model.
The linear model must be the result of a StatsModel OLS."""
fig, ax = plt.subplots(nrows=2,ncols=2, figsize=figsize)
plot_1_resid_vs_fitted(lm_fit, ax=ax[0,0]);
plot_2_residuals_vs_normal(lm_fit, ax=ax[0,1]);
plot_3_std_resid_vs_fitted(lm_fit, ax=ax[1,0]);
plot_4_std_resid_vs_leverage(lm_fit, ax=ax[1,1]);
if title: fig.suptitle(title, fontsize=18)
return fig
plot_lm_summary(lm_fit);