3.1-Worksheet-A-Solutions

3.1 Regression Worksheet A

  1. Analyse the data set Geyser: Old faithful eruption and "reloading" times.
  2. Learn to draw random samples from a few different distributions.
  3. Create a noisy data set for a regression. With this setup, you can not only do the analysis but compare to the known correct answer.

Optional: try Matplotlib's low-level plot command to add points or lines to an existing graph.

Preliminaries

In [2]:
# !pip install seaborn==0.9.0
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set()
import warnings
warnings.simplefilter('ignore',FutureWarning)
In [3]:
geyser = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/geyser.csv')
geyser.drop(columns=['Unnamed: 0'], inplace=True)

Tools to know

  • Seaborn: scatterplot, lmplot (Linear Model plot), residplot (Residual Plot)
  • Linear models: StatsModels.
    • Making: smf.ols, fit.
    • Results: params, pvalues, summary(), get_prediction(), conf_int(). Use alpha to change confidence.

A. Geyser analysis

  1. Make a scatterplot of the data showing time until eruption vs how long the eruption lasts.
  2. Fit a linear model to the data. Which parameters are significant at the 5% level?
  3. Plot the residuals.
  4. Does it appear that the errors all have the same variance? Explain.
  5. When you look at the plots, something should seem unusual. What is it? Make up one way to explore the question and do it.
  6. It can also be good to think about what additional data would help you investigate the question (if you knew it). Can you think of anything like this?
In [4]:
lm = smf.ols('duration ~ waiting', data=geyser).fit()
In [5]:
(b0,b1) = lm.params
In [6]:
yfitted = b0 + b1 * geyser.waiting
In [7]:
resid = geyser.duration - yfitted
In [8]:
sns.scatterplot(x=geyser.waiting, y=resid);
In [9]:
short = geyser.duration < 3
In [10]:
geyser['short'] = short
In [11]:
sns.scatterplot(data=geyser,x='waiting',y='duration');
sum(geyser['duration'] == 4)
Out[11]:
53
In [12]:
sns.residplot(y='duration', x='waiting', data=geyser);
In [13]:
sns.lmplot(y='duration', x='waiting', hue='short', data=geyser);
In [14]:
sns.scatterplot(y='duration', x='waiting', hue='short', data=geyser);
In [15]:
sns.residplot(y='duration', x='waiting', data=geyser[short]);
In [16]:
sns.residplot(y='duration', x='waiting', data=geyser[~short]);