Chapter3-Section2-MultipleRegression

Chapter 3 Section 2

In [1]:
# !pip install seaborn==0.9.0
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set()
import warnings
warnings.simplefilter('ignore',FutureWarning)
/Users/home/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [2]:
advertising = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
advertising.drop(columns='Unnamed: 0', inplace=True)
In [3]:
df = advertising.copy()
df_dependent = 'sales'
df_independent = ['TV','radio','newspaper']
df.columns
Out[3]:
Index(['TV', 'radio', 'newspaper', 'sales'], dtype='object')
In [4]:
lm_tv = smf.ols('sales ~ TV', data=df).fit()
lm_radio = smf.ols('sales ~ radio', data=df).fit()
lm_newspaper = smf.ols('sales ~ newspaper', data=df).fit()
lm_all = smf.ols('sales ~ TV + radio + newspaper', data=df).fit()

Reading Guide

  • Why do we see newspaper having a significant effect on sales in a single regression if it actually does not? Explain carefully.
  • What do ice cream sales have to do with newspaper advertising (in the book)?

F-statistic

  • (Section 3.2.2) What is the purpose of the F-statistic? How about its definition?
  • What value of the F-statistic indicates no relationship?
  • Why not just use individual p-values for predictors?
  • How does the F-statistic improve the individual p-value situation? (What extra thing does it do?)
  • In what situation is the F-statistic useless?

What is important?

  • List different methods of picking important variables in your notes.
  • Why is it not possible to just try all of the variable combinations and pick the one that perfoms best? (Actually, we will try this later. It is "possible" just not feasible. Why not?)

Model Fit

  • Can $R^2$ be used to select important variables (by just adding more variables until $R^2$ is maximized)?
  • Examine Figure 3.5. The errors have patterns that are discussed in the book. Can you find them? Describe one.

Predictions

  • What is the prediction interval used for?

NOTES

  • Make sure to do the simple regressions at the start.
  • Compare to multiple regression.

DISCUSS

  • Simpler formula for F-statistic thinking like StatsModels
  • Distributions. F-distribution? How about how to do math with one in SciPy?
  • Why is F-statistic important? Consider lots of predictors and their individual p-values.

TODO

  • Graphics - 2d regression plane plot with errors? or just discuss 1d situation?
In [ ]:
lm_newspaper.summary()
In [ ]:
lm_all.summary()
In [9]:
df[df_independent].corr()
Out[9]:
TV radio newspaper
TV 1.000000 0.054809 0.056648
radio 0.054809 1.000000 0.354104
newspaper 0.056648 0.354104 1.000000

Easier thinking about F value: (mean explained sum of squares) / (mean residual sum of squares).

In [11]:
lm_all.fvalue
Out[11]:
570.2707036590942
In [12]:
lm_all.mse_model / lm_all.mse_resid
Out[12]:
570.2707036590942