4. Classification Lab

Chapter4-Section6-Lab

Chapter 4 Lab - Classification

In [1]:
# !pip install seaborn==0.9.0
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels
import patsy
import math
sns.set()
import warnings
warnings.simplefilter('ignore',FutureWarning)
# 4.6.3
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# 4.6.4
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# 4.6.5
from sklearn.neighbors import KNeighborsClassifier
# 4.6.6
import sklearn.preprocessing
# sklearn better than statsmodels?
from sklearn.linear_model import LogisticRegression
In [189]:
# debug printing appears immediately
import sys
def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)
In [2]:
smarket_1 = pd.read_csv('Smarket.csv', na_values='?');
#smarket_1 = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/ISLR/Smarket.csv', na_values = '?')
In [3]:
smarket = smarket_1.drop(columns=['Unnamed: 0'])
In [4]:
smarket.corr()
Out[4]:
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
Year 1.000000 0.029700 0.030596 0.033195 0.035689 0.029788 0.539006 0.030095
Lag1 0.029700 1.000000 -0.026294 -0.010803 -0.002986 -0.005675 0.040910 -0.026155
Lag2 0.030596 -0.026294 1.000000 -0.025897 -0.010854 -0.003558 -0.043383 -0.010250
Lag3 0.033195 -0.010803 -0.025897 1.000000 -0.024051 -0.018808 -0.041824 -0.002448
Lag4 0.035689 -0.002986 -0.010854 -0.024051 1.000000 -0.027084 -0.048414 -0.006900
Lag5 0.029788 -0.005675 -0.003558 -0.018808 -0.027084 1.000000 -0.022002 -0.034860
Volume 0.539006 0.040910 -0.043383 -0.041824 -0.048414 -0.022002 1.000000 0.014592
Today 0.030095 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860 0.014592 1.000000
In [5]:
sns.scatterplot(x=smarket.index, y='Volume', data=smarket);

4.6.2 Logistic Regression

There are some snags using StatsModels to do logistic regression.

The biggest issue is that the formula processor from Patsy does not directly work on the outputs (y values). Patsy generates two complementary columns that sum to all ones. Since you can only use one column for the response, you need to pick one. At this point you could just convert the original Yes/No answers to 1/0 instead... so I do that.

LogitResults are the output of the fitting process. Some important fields of LogitResults are listed here:

  • params: Coefficients.
  • pvalues
  • fittedvalues
  • bse: Coefficient standard errors.
  • aic: A measure of model fit.

There are also methods to get items like residual deviance. Check the documentation.

In [6]:
smarket["DirectionUp"] = np.array(
    [1 if up else 0 for up in smarket.Direction == "Up"])
In [7]:
dm = sm.Logit.from_formula("DirectionUp ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket).fit()
Optimization terminated successfully.
         Current function value: 0.691034
         Iterations 4
In [8]:
dm.params
Out[8]:
Intercept   -0.126000
Lag1        -0.073074
Lag2        -0.042301
Lag3         0.011085
Lag4         0.009359
Lag5         0.010313
Volume       0.135441
dtype: float64
In [9]:
dm.pvalues
Out[9]:
Intercept    0.600700
Lag1         0.145232
Lag2         0.398352
Lag3         0.824334
Lag4         0.851445
Lag5         0.834998
Volume       0.392404
dtype: float64
In [10]:
dm.fittedvalues.head() # theres lots, head is to not see them all
Out[10]:
0    0.028338
1   -0.074162
2   -0.075480
3    0.060908
4    0.043131
dtype: float64
In [11]:
dm.aic
Out[11]:
1741.5840942032346
In [12]:
dm.llf
Out[12]:
-863.7920471016173
In [13]:
dm.bse
Out[13]:
Intercept    0.240737
Lag1         0.050168
Lag2         0.050086
Lag3         0.049939
Lag4         0.049974
Lag5         0.049512
Volume       0.158361
dtype: float64

Predictions for the existing data set come from the predict() method. To get predictions for another dataset, make another data frame.

In [14]:
predictions_1 = dm.predict();
len(predictions_1)
Out[14]:
1250
In [15]:
smarket.head()
Out[15]:
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction DirectionUp
0 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up 1
1 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up 1
2 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down 0
3 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up 1
4 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up 1

Make a new data frame to get the probability of "up" for a given set of predictors.

In [16]:
new_prediction_data = pd.DataFrame({'Lag1': 0.8, 'Lag2': 0.5, 'Lag3': -0.5, 'Lag4': -0.2, 'Lag5': 3, 'Volume': 7 }, 
                                  index = [0])
predictions_2 = dm.predict(new_prediction_data)
predictions_2
Out[16]:
0    0.682655
dtype: float64

We still need to make the confusion matrix and do a prediction based on some data. In the confusion matrix, we need to turn the probabilities into either 0 or 1 decisions, so round them. In a different case, you could use an "if" statement.

In [17]:
up_predicted = pd.Series( dm.predict().round(), name = "Predicted")
up_actual = pd.Series(smarket.DirectionUp, name = "Actual")
confusion_matrix = pd.crosstab(up_actual, up_predicted)
confusion_matrix
Out[17]:
Predicted 0.0 1.0
Actual
0 145 457
1 141 507
In [18]:
correct_count = confusion_matrix[0][0] + confusion_matrix[1][1]
total = len(smarket)
print("supposed accuracy (but bad): ", correct_count / total)
supposed accuracy (but bad):  0.5216

Assessing a model based on data that you used for training is kind of suspicious. To keep things legit, we need to separate out data and not use it for training, only to check to see the quality of the model. The code below selects a random 10% to be the held out data.

In [19]:
test_data = np.random.random(1250) < 0.10
#test_data = smarket.Year >= 2005 # book does this
training_data = ~test_data
In [20]:
sum(test_data)
Out[20]:
130
In [21]:
dm_parts = sm.Logit.from_formula("DirectionUp ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket[training_data]).fit()
correct = pd.Series(smarket.DirectionUp[test_data], name = "Actual")
predicted = pd.Series(dm_parts.predict(smarket[test_data]).round(), name="Predicted")
confusion_matrix = pd.crosstab(correct,predicted)
confusion_matrix
Optimization terminated successfully.
         Current function value: 0.690159
         Iterations 4
Out[21]:
Predicted 0.0 1.0
Actual
0 14 51
1 12 53
In [22]:
accuracy = np.mean(correct == predicted)
print("accuracy (done right): ", accuracy)
accuracy (done right):  0.5153846153846153
In [23]:
error_rate = np.mean(correct != predicted)
error_rate
Out[23]:
0.4846153846153846

Just predicting "up" gives only a 52% accuracy. That is called predicting the "baseline rate". Your prediction method should do better than that to be useful.

In [24]:
sum(smarket.DirectionUp == 1) / len(smarket)
Out[24]:
0.5184

Book says try again with just Lag1 and Lag2......... but that does not seem to do any better for me.

In [25]:
dm_parts = sm.Logit.from_formula("DirectionUp ~ Lag1+Lag2", data=smarket[training_data]).fit()
correct = pd.Series(smarket.DirectionUp[test_data], name = "Actual")
predicted = pd.Series(dm_parts.predict(smarket[test_data]).round(), name="Predicted")
accuracy = np.mean(correct == predicted)
accuracy
Optimization terminated successfully.
         Current function value: 0.690749
         Iterations 4
Out[25]:
0.5

As usual, the predict method gets you probabilities for the value you predicted - in this case, DirectionUp. Both of the values below are solidly in the middle of the probability range, but 0.48 would be predicted "Down" and 0.50 would be "Up".

In [26]:
more_demo_data = pd.DataFrame({'Lag1': [1.2, 1.1], 'Lag2': [1.5, -0.8]})
dm_parts.predict(more_demo_data)
Out[26]:
0    0.479911
1    0.501970
dtype: float64

4.6.3 Linear Discriminant Analysis

The biggest thing I notice about using LDA is that it does not have the nice interface of Pandas DataFrames and Patsy formulas built in. No actual deficiencies noted. R has better summarization info for LDA, but do you really use it?

Note: Do not include a constant term in the design matrix for Linear Discriminant Analysis (also not for Quadratic D.A. in the next section)!

In [27]:
test_data = np.random.random(1250) < 0.10
train_data = ~ test_data
lda = LinearDiscriminantAnalysis()
X = patsy.dmatrix("Lag1+Lag2+0", data=smarket[train_data])
y = smarket.DirectionUp[train_data]
lda_fit = lda.fit(X,y)

I do not know what the warning about collinear means.

Assess the quality of your model by seeing what fraction of your predictions are correct on the test data. The cells below show 46% accuracy.

In [28]:
X_test = patsy.dmatrix("Lag1+Lag2+0", data=smarket[test_data])
ypredict = lda_fit.predict(X_test)
ycorrect = smarket.DirectionUp[test_data]
np.mean(ypredict == ycorrect)
Out[28]:
0.46218487394957986

The score method returns the accuracy without you having to do any of the work by hand. LDA gives models that can score, but not all classifiers do.

In [29]:
lda_fit.score(X_test, ycorrect)
Out[29]:
0.46218487394957986

Predict the probabilities for each class (one column per class). The book explains that these are "posterior probabilities" (meaning after the model is fitted). Check!

In [30]:
lda_test_predictions = lda_fit.predict_proba(X_test)
lda_test_predictions[1:5]
Out[30]:
array([[0.48110417, 0.51889583],
       [0.45028127, 0.54971873],
       [0.44258594, 0.55741406],
       [0.52719747, 0.47280253]])

Decision function gives a number, positive is one class, negative is the other.

In [31]:
lda.decision_function(X_test)[1:5]
Out[31]:
array([ 0.07561932,  0.1995343 ,  0.23067368, -0.10889735])

The book suggests we try changing the probability threshhold. How many observations are 55% certain to get class 1 (which is the second column)?

In [32]:
np.sum(lda_test_predictions[:,1] > 0.55)
Out[32]:
20

4.6.4 Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) works the same as LDA. The difference is that the decision boundaries can be more flexible. Read the book! The coding seems the same. Note the nice high score, almost 60% right on the test data when you split training and test data by year. Unfortunately when you randomly choose about 10% of the data for test cases, the amazing prediction power goes away and you are stuck with a lousy 45% correct.

In [33]:
#test_data = np.random.random(1250) < 0.10
test_data = smarket.Year >= 2005
train_data = ~ test_data
qda = QuadraticDiscriminantAnalysis()
X = patsy.dmatrix("Lag1+Lag2+0", data=smarket[train_data])
y = smarket.DirectionUp[train_data]
qda_fit = qda.fit(X,y)

X_test = patsy.dmatrix("Lag1+Lag2+0", data=smarket[test_data])
ypredict = qda_fit.predict(X_test)
ycorrect = smarket.DirectionUp[test_data]
np.mean(ypredict == ycorrect)
Out[33]:
0.5992063492063492

4.6.5 K-nearest neighbors

K-nearest neighbors is very similar to LDA and QDA in its execution.

There is also a related classifier that uses all of the points within a given radius to classify a point. Good when the points are not all evenly distributed in the sample space.

In [34]:
neighbors = 3
knn = KNeighborsClassifier(neighbors)
knn_fit = knn.fit(X, y)
knn_predict = knn.predict(X_test)
In [35]:
np.mean(knn_predict == ycorrect)
Out[35]:
0.5317460317460317

Investigate the accuracy of KNN as the number of neighbors increases.

In [36]:
X_train = patsy.dmatrix("Lag1+Lag2+0", data=smarket[train_data])
y_train = smarket.DirectionUp[train_data]
X_test = patsy.dmatrix("Lag1+Lag2+0", data=smarket[test_data])
y_test = smarket.DirectionUp[test_data]

def knn_accuracy(neighbors):
    knn = KNeighborsClassifier(neighbors)
    knn_fit = knn.fit(X_train, y_train)
    knn_predict = knn.predict(X_test)
    return np.mean(knn_predict == ycorrect)

ns = np.arange(1,25)
y = [knn_accuracy(n) for n in ns]
ax = sns.lineplot(x=ns,y=y);
ax.set_ylim([0,0.6]);
ax.set_xlabel('neighbors');
ax.set_ylabel('accuracy');

4.5.6 Example: Caravan insurance

Important point #1: standardize your data before you do KNN. Or anything, really. Use sklearn.preprocessing.scale.

Caravan dataset documentation. Column 86, 'Purchase', indicates 'Yes' or 'No' did the person purchase a policy.

In [37]:
#df_caravan_1 = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/ISLR/Caravan.csv')
df_caravan_1 = pd.read_csv('Caravan.csv')
In [38]:
df_caravan = df_caravan_1.drop(columns=['Unnamed: 0'])
In [39]:
predictors = set(df_caravan.columns) - set(['Purchase'])
purchase = df_caravan.Purchase

How many data points are there? How many purchases vs not? What is the rate of sales?

In [40]:
purchase.count()
Out[40]:
5822
In [41]:
purchase.value_counts()
Out[41]:
No     5474
Yes     348
Name: Purchase, dtype: int64
In [42]:
purchase.value_counts() / purchase.count()
Out[42]:
No     0.940227
Yes    0.059773
Name: Purchase, dtype: float64

SciKit-Learn has a preprocessing module that will automatically adjust your data to have mean zero and standard deviation one. sklearn.preprocessing.scale

In [43]:
test = df_caravan.index < 1000
train = ~ test
X_scaled = sklearn.preprocessing.scale(df_caravan[list(predictors)])
X = X_scaled[train]
y = purchase[train]
knn = KNeighborsClassifier(1)
knn_fit = knn.fit(X,y)
In [44]:
y_correct = purchase[test]
X_predict = X_scaled[test]
y_predict = knn_fit.predict(X_predict)
accuracy = np.mean(y_correct == y_predict)
errors = 1-accuracy
accuracy
Out[44]:
0.882
In [45]:
errors
Out[45]:
0.118

How many people who are predicted to buy insurance actually buy it? We want to know if we should concentrate on the people who the model says will buy. According to this math about 11+% of those predictions are correct.

In [46]:
buy = y_predict == 'Yes'
correct_predictions = y_correct == y_predict 
np.mean(correct_predictions[buy])
Out[46]:
0.11688311688311688

We automate the process of finding out how many of the predicted buyers actually buy so we can see how changing the number of neighbors affects the percent of predicted buys who actually buy.

In [47]:
test = df_caravan.index < 1000
train = ~ test
X_scaled = sklearn.preprocessing.scale(df_caravan[list(predictors)])

X = X_scaled[train]
y = purchase[train]
y_correct = purchase[test]
X_predict = X_scaled[test]

def knn_success_rate(neighbors):
    knn = KNeighborsClassifier(neighbors)
    knn_fit = knn.fit(X,y)
    y_predict = knn_fit.predict(X_predict)
    correct = y_correct == y_predict
    buy = y_predict == 'Yes'
    return np.mean(correct[buy])

We see that as the number of neighbors goes up, the prediction rate goes up. The book does not carry this analysis far enough to get to the suspicious part.... 100% prediction rate with k=9 neighbors. As k increases, the pool of "Yes" reponses shrinks to almost zero.

In [48]:
neighbors = np.arange(1,10)
brate = [knn_success_rate(k) for k in neighbors]
In [49]:
pframe = pd.DataFrame({'neighbors': neighbors, 'rate': brate})
#odds = [ n%2 == 1 for n in neighbors ] # are odds better than evens?
sns.lineplot(x='neighbors',y='rate',data=pframe);
In [62]:
def score(correct):
    if len(correct) == 0: return 0
    return np.mean(correct)

Next we investigate a "saturated" logistic model and find out the rate of predictions who buy. When you use a cutoff of 50%, the model is terrible, but if you change the cutoff to 25% or so, it gets better.

I had trouble with this. Get the raw probabilities with predict_proba. If you train with answers being strings, then when you predict you will get those strings, so make your life easier, train with numbers for the response. Predictions from predict_proba have one column for each possible predicted value, so in this case even though it we are modeling a binary response we get two columns (for 0 and 1).

In [101]:
test = np.zeros(len(df_caravan), dtype=bool)
test[0:1000] = True
test = np.random.random(len(df_caravan)) < 0.10
train = ~ test
X_scaled = sklearn.preprocessing.scale(df_caravan[list(predictors)])

noyes = np.vectorize(lambda s: 1 if s=='Yes' else 0)
    
X = X_scaled[train]
y = noyes(purchase[train])
y_correct = noyes(purchase[test])
X_predict = X_scaled[test]

model_fit = LogisticRegression().fit(X,y)
y_predict_p = model_fit.predict_proba(X_predict)
y_predict = (y_predict_p > 0.25)[:,1]  # column 0 = No, 1=Yes
correct = y_correct == y_predict
buy = y_predict == 1
print("Predicted to purchase:",np.sum(buy))
print("Score:", score(correct[buy]))
Predicted to purchase: 19
Score: 0.3157894736842105

Note on the text: with the first 1000 items as the training set, the prediction is very good, 35 predicted to purchase, accuracy score of over 30%. With a random 10% of the data set held out, the predictions are much worse, ~18 predicted to purchase, ~17% score.

In [131]:
noyes = np.vectorize(lambda s: 1 if s=='Yes' else 0)
X_scaled = sklearn.preprocessing.scale(df_caravan[list(predictors)])

def caravan_predict(test, threshold = 0.25):
    """Given test vector of length df_caravan, return (score, number of yes). 
    The threshold parameter controls the probability at which a yes is returned."""
    train = ~ test

    X = X_scaled[train]
    y = noyes(purchase[train])
    y_correct = noyes(purchase[test])
    X_predict = X_scaled[test]

    model_fit = LogisticRegression().fit(X,y)
    y_predict_p = model_fit.predict_proba(X_predict)
    y_predict = (y_predict_p > threshold)[:,1]  # column 0 = No, 1=Yes
    correct = y_correct == y_predict
    buy = y_predict == 1
    return (score(correct[buy]), np.sum(buy))
In [153]:
results = np.array([caravan_predict(np.random.random(len(df_caravan)) < 0.10) 
                    for _ in range(20)]) # 100 takes a long time...
In [154]:
fig, ax = plt.subplots(1,2,figsize=(16,4))

ax[0].set_title("Accuracy given YES prediction")
sns.boxplot(results[:,0], ax=ax[0]);

ax[1].set_title("Number of YES predictions")
sns.boxplot(results[:,1], ax=ax[1]);

We get about 22% accuracy and 15 predicted BUY over 100 runs.

In [157]:
np.mean(results,axis=0)
Out[157]:
array([ 0.22785236, 15.5       ])

We study the threshold and find that roughly 20-30% threshold has the same accuracy on predicted yes values (about 22%). This was compute-intensive.

In [201]:
def threshold_finder_2(thresh):
    test_set = np.random.random(len(df_caravan)) < 0.10
    return caravan_predict(test_set,
                           threshold = thresh)
threshold_finder_2v = np.vectorize(threshold_finder_2)

def threshold_finder_1(thresh, N=50):
    return np.mean([threshold_finder_2(thresh) for _ in range(N)],
                   axis=0)

def threshold_finder(thresholds, N=50):
    return np.array([threshold_finder_1(t, N=N) for t in thresholds])
In [197]:
## This takes a long time to run (maybe 15 minutes on my computer)
#ts = np.linspace(0.01,0.50,50) 
#vals = threshold_finder(ts)
long_work = pd.read_csv('insurance-predictions.csv')
ts = long_work['threshold']
vals = np.zeros((len(long_work),2))
vals[:,0] = long_work['accuracy']
vals[:,1] = long_work['predictedYES']
In [200]:
fig, axs = plt.subplots(1,2,figsize=(16,4))
axs[0].set_title('accuracy on predicted YES customers')
axs[0].set_xlabel('threshold')
axs[0].set_ylabel('accuracy')
sns.scatterplot(ts, vals[:,0], ax=axs[0]);
#axs[1].set_title('number of predictes YES customers')
axs[1].set_title('total predicted')
axs[1].set_xlabel('threshold')
axs[1].set_ylabel('YES')
sns.scatterplot(ts, vals[:,1], ax=axs[1]);
In [195]:
#long_work = pd.DataFrame({'threshold':ts, 'accuracy': vals[:,0], 'predictedYES': vals[:,1]});
#long_work.to_csv('insurance-predictions.csv')