ISL Lab 2.3 Introduction to Statistical Computing with Python

In [ ]:
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
import warnings
warnings.simplefilter('ignore',FutureWarning)

2.3.1 Basic Commands

In [ ]:
np.arange(6)
In [ ]:
a = np.arange(6)
b = a.reshape((2,3))
b
In [ ]:
np.sqrt(b)
In [ ]:
rnorm = scipy.stats.norm(loc=0,scale=1)  # mean =  loc = 0, standard_deviation = scale = 1
x = rnorm.rvs(size=50)

err = scipy.stats.norm(loc=50, scale=0.1)
y = err.rvs(size=50)
In [ ]:
np.corrcoef(x,y)
In [ ]:
np.random.seed(1303)
rnorm.rvs(size=8)
# Notice - same random numbers all of the time
In [ ]:
np.random.seed(3)
y = rnorm.rvs(size=100)
np.mean(y)
In [ ]:
np.var(y)
In [ ]:
np.sqrt(np.var(y))
In [ ]:
np.std(y)

2.3.2 Graphics

In [ ]:
x = rnorm.rvs(size=100)
y = rnorm.rvs(size=100)
ax = sns.scatterplot(x,y);

Have to dig back into MatPlotLib to set axis labels, so all is not perfect.

In [ ]:
ax = sns.scatterplot(x,y);
ax.set(xlabel="the x-axis",ylabel="the y-axis")
plt.show()

Adding a title is a little more annoying, per Stack Overflow explanation of adding a title to a Seaborn plot. There are more complex explanations that work with multiple subplots.

In [ ]:
ax = sns.scatterplot(x,y);
ax.set_xlabel('independent var')
ax.set_ylabel('dependent var')
ax.set_title('Massive Title')
plt.show();

Saving an image to a file is also pretty straightforward using savefig from PyPlot.

In [ ]:
ax = sns.scatterplot(x,y);
ax.set_title('Save this plot')
plt.savefig('unlabeled-axes.png');
# ugliness to avoid showing figure:
fig = plt.gcf()
plt.close(fig)

np.linspace makes equally spaced steps between the start and end

In [ ]:
x = np.linspace(-np.pi,np.pi,50)

A contour plot needs a 2D array of z values (x,y) -> f(x,y).

The hard part is getting the inputs to the function, or convincing f not to vectorize over x,y in parallel.

In [ ]:
x = np.linspace(-np.pi,np.pi,50)
y = x # for clarity only
xx,yy = np.meshgrid(x,y)
In [ ]:
def fbasic(x,y): return np.cos(y) / (1+x**2)
f = np.vectorize(lambda x,y: np.cos(y) / (1+x**2))
z = f(xx,yy)
plt.contour(z);
In [ ]:
plt.contour(z,45);
In [ ]:
def g1(x,y): return (fbasic(x,y)+fbasic(y,x))/2
g = np.vectorize(g1)
z2 = g(xx,yy)
plt.contour(z2,15);

imshow shows an image, like the R command image. Surely there is a way to get the coordinates input as well as the z, but in practice a regular grid seems most likely.

In [ ]:
randompix = np.random.random((16, 16))
plt.imshow(randompix);

3D Rendering

In [ ]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

surf = ax.plot_surface(xx,yy,z, cmap=cm.coolwarm);
plt.show();

2.3.3 Indexing Data

In [ ]:
a = np.arange(1,17).reshape((4,4)).T   # matches R example
In [ ]:
print(a)
a[1,2]

Beware if following code in the book. R indices start at 1, while Python indices start at 0.

In [ ]:
a[[0,2],[1,3]]
In [ ]:
a[[0,2],:]
In [ ]:
a[:,[1,3]]

If you combine the two in one set of brackets, they are traversed in parallel, getting you a[0,1] and a[2,3].

In [ ]:
a[[0,2],[1,3]]

When you want a sub-array, index twice.

In [ ]:
a[[0,2],:][:,[1,3]]

The ix_ function makes grids out of indices that you give it. Clearer for this!

In [ ]:
a[np.ix_([0,2],[1,3])]

Note: R ranges include the last item, Python ranges do not.

In [ ]:
a[np.ix_(np.arange(0,3),np.arange(1,4))]
In [ ]:
a[[0,1],]
In [ ]:
a[:,[0,1]]
In [ ]:
a[1,]

Dropping columns is not as convenient in Python.

In [ ]:
b = np.delete(a,[0,2],0)
b
In [ ]:
c = np.delete(b,[0,2,3],1)
c
In [ ]:
a.shape

2.3.4 Loading Data

Note: To get the data from a preloaded R dataset, I do write_table(the_data, filename="whatever", sep="\t") in R.

Cool fact: read_table can load straight from a URL.

In [ ]:
#auto = pd.read_table("Auto.data")
auto = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")

Get rid of any rows with missing data. This is not always a good idea.

In [ ]:
auto = auto.dropna()
In [ ]:
auto.shape
In [ ]:
auto.columns

2.3.5 Graphical and Numerical Summaries

In [ ]:
sns.scatterplot(auto['cylinders'], auto['mpg']);
In [ ]:
sns.boxplot(x="cylinders", y="mpg", data=auto);
In [ ]:
sns.stripplot(x="cylinders", y="mpg", data=auto);
In [ ]:
sns.distplot(auto['mpg']);
In [ ]:
sns.distplot(auto['mpg'],bins=15, kde=False, vertical=True);
In [ ]:
sns.pairplot(data=auto);
In [ ]:
sns.pairplot(data=auto[['mpg','displacement','horsepower',
                        'weight','acceleration']]);

I am not aware of a way to interactively identify points on a matplotlib plot that is similar to the R command identify.

In [ ]:
auto.describe()
In [ ]:
auto['name'].value_counts().head()
In [ ]:
auto['mpg'].describe()

Miscellaneous Notes

Categorical data can be constructed using astype('category') in Pandas. Read more about categorical data if you need the information.

In [ ]:
auto['cylinders'] = auto['cylinders'].astype('category')
auto['cylinders'].describe()

Homework Starter

Easy access to ISL datasets if you have internet access.

In [ ]:
college = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")