ISL Lab 2.3

In [38]:
import numpy as np
import pandas as pd
import scipy
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
In [202]:
import warnings
warnings.simplefilter('ignore',FutureWarning)
In [2]:
np.arange(6)
Out[2]:
array([0, 1, 2, 3, 4, 5])
In [3]:
a = np.arange(6)
b = a.reshape((2,3))
b
Out[3]:
array([[0, 1, 2],
       [3, 4, 5]])
In [4]:
np.sqrt(b)
Out[4]:
array([[0.        , 1.        , 1.41421356],
       [1.73205081, 2.        , 2.23606798]])
In [11]:
rnorm = scipy.stats.norm(loc=0,scale=1)  # mean =  loc = 0, standard_deviation = scale = 1
x = rnorm.rvs(size=50)

err = scipy.stats.norm(loc=50, scale=0.1)
y = err.rvs(size=50)
In [20]:
np.corrcoef(x,y)
Out[20]:
array([[1.        , 0.09368519],
       [0.09368519, 1.        ]])
In [24]:
np.random.seed(1303)
rnorm.rvs(size=8)
# Notice - same random numbers all of the time
Out[24]:
array([-0.03425693,  0.06035959,  0.45511859, -0.36593175, -1.6773304 ,
        0.5910023 ,  0.41090101,  0.46972388])
In [25]:
np.random.seed(3)
y = rnorm.rvs(size=100)
np.mean(y)
Out[25]:
-0.10863707440606224
In [26]:
np.var(y)
Out[26]:
1.132081888283007
In [27]:
np.sqrt(np.var(y))
Out[27]:
1.0639933685333791
In [30]:
np.std(y)
Out[30]:
1.0639933685333791

2.3.2 Graphics

In [44]:
x = rnorm.rvs(size=100)
y = rnorm.rvs(size=100)
ax = sns.scatterplot(x,y);

Have to dig back into MatPlotLib to set axis labels, so all is not perfect.

In [45]:
ax = sns.scatterplot(x,y);
ax.set(xlabel="the x-axis",ylabel="the y-axis")
plt.show()

Adding a title is a little more annoying, per Stack Overflow explanation of adding a title to a Seaborn plot. There are more complex explanations that work with multiple subplots.

In [53]:
ax = sns.scatterplot(x,y);
ax.set_xlabel('independent var')
ax.set_ylabel('dependent var')
ax.set_title('Massive Title')
plt.show();

Saving an image to a file is also pretty straightforward using savefig from PyPlot.

In [58]:
ax = sns.scatterplot(x,y);
ax.set_title('Save this plot')
plt.savefig('unlabeled-axes.png');
# ugliness to avoid showing figure:
fig = plt.gcf()
plt.close(fig)

np.linspace makes equally spaced steps between the start and end

In [ ]:
x = np.linspace(-np.pi,np.pi,50)

A contour plot needs a 2D array of z values (x,y) -> f(x,y).

The hard part is getting the inputs to the function, or convincing f not to vectorize over x,y in parallel.

In [71]:
x = np.linspace(-np.pi,np.pi,50)
y = x # for clarity only
xx,yy = np.meshgrid(x,y)
In [83]:
def fbasic(x,y): return np.cos(y) / (1+x**2)
f = np.vectorize(lambda x,y: np.cos(y) / (1+x**2))
z = f(xx,yy)
plt.contour(z);
In [80]:
plt.contour(z,45);
In [92]:
def g1(x,y): return (fbasic(x,y)+fbasic(y,x))/2
g = np.vectorize(g1)
z2 = g(xx,yy)
plt.contour(z2,15);

imshow shows an image, like the R command image. Surely there is a way to get the coordinates input as well as the z, but in practice a regular grid seems most likely.

In [99]:
randompix = np.random.random((16, 16))
plt.imshow(randompix);

3D Rendering

In [109]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

surf = ax.plot_surface(xx,yy,z, cmap=cm.coolwarm);
plt.show();

2.3.3 Indexing Data

In [163]:
a = np.arange(1,17).reshape((4,4)).T   # matches R example
In [166]:
print(a)
a[1,2]
[[ 1  5  9 13]
 [ 2  6 10 14]
 [ 3  7 11 15]
 [ 4  8 12 16]]
Out[166]:
10

Beware if following code in the book. R indices start at 1, while Python indices start at 0.

In [165]:
a[[0,2],[1,3]]
Out[165]:
array([ 5, 15])
In [128]:
a[[0,2],:]
Out[128]:
array([[ 1,  2,  3,  4],
       [ 9, 10, 11, 12]])
In [129]:
a[:,[1,3]]
Out[129]:
array([[ 2,  4],
       [ 6,  8],
       [10, 12],
       [14, 16]])

If you combine the two in one set of brackets, they are traversed in parallel, getting you a[0,1] and a[2,3].

In [127]:
a[[0,2],[1,3]]
Out[127]:
array([ 2, 12])

When you want a sub-array, index twice.

In [131]:
a[[0,2],:][:,[1,3]]
Out[131]:
array([[ 2,  4],
       [10, 12]])

The ix_ function makes grids out of indices that you give it. Clearer for this!

In [168]:
a[np.ix_([0,2],[1,3])]
Out[168]:
array([[ 5, 13],
       [ 7, 15]])

Note: R ranges include the last item, Python ranges do not.

In [170]:
a[np.ix_(np.arange(0,3),np.arange(1,4))]
Out[170]:
array([[ 5,  9, 13],
       [ 6, 10, 14],
       [ 7, 11, 15]])
In [172]:
a[[0,1],]
Out[172]:
array([[ 1,  5,  9, 13],
       [ 2,  6, 10, 14]])
In [174]:
a[:,[0,1]]
Out[174]:
array([[1, 5],
       [2, 6],
       [3, 7],
       [4, 8]])
In [136]:
a[1,]
Out[136]:
array([5, 6, 7, 8])

Dropping columns is not as convenient in Python.

In [176]:
b = np.delete(a,[0,2],0)
b
Out[176]:
array([[ 2,  6, 10, 14],
       [ 4,  8, 12, 16]])
In [177]:
c = np.delete(b,[0,2,3],1)
c
Out[177]:
array([[6],
       [8]])
In [138]:
a.shape
Out[138]:
(4, 4)

2.3.4 Loading Data

Note: To get the data from a preloaded R dataset, I do write_table(the_data, filename="whatever", sep="\t") in R.

Cool fact: read_table can load straight from a URL.

In [238]:
#auto = pd.read_table("Auto.data")
auto = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")

Get rid of any rows with missing data. This is not always a good idea.

In [240]:
auto = auto.dropna()
In [241]:
auto.shape
Out[241]:
(397, 9)
In [242]:
auto.columns
Out[242]:
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

2.3.5 Graphical and Numerical Summaries

In [194]:
sns.scatterplot(auto['cylinders'], auto['mpg']);
In [193]:
sns.boxplot(x="cylinders", y="mpg", data=auto);
In [192]:
sns.stripplot(x="cylinders", y="mpg", data=auto);
In [203]:
sns.distplot(auto['mpg']);
In [211]:
sns.distplot(auto['mpg'],bins=15, kde=False, vertical=True);
In [190]:
sns.pairplot(data=auto);
In [213]:
sns.pairplot(data=auto[['mpg','displacement','horsepower',
                        'weight','acceleration']]);

I am not aware of a way to interactively identify points on a matplotlib plot that is similar to the R command identify.

In [214]:
auto.describe()
Out[214]:
mpg cylinders displacement horsepower weight acceleration year origin
count 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000
mean 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327 75.979592 1.576531
std 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864 3.683737 0.805518
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.000000 4.000000 105.000000 75.000000 2225.250000 13.775000 73.000000 1.000000
50% 22.750000 4.000000 151.000000 93.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 275.750000 126.000000 3614.750000 17.025000 79.000000 2.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000 3.000000
In [217]:
auto['name'].value_counts().head()
Out[217]:
ford pinto            5
amc matador           5
toyota corolla        5
toyota corona         4
chevrolet chevette    4
Name: name, dtype: int64
In [218]:
auto['mpg'].describe()
Out[218]:
count    392.000000
mean      23.445918
std        7.805007
min        9.000000
25%       17.000000
50%       22.750000
75%       29.000000
max       46.600000
Name: mpg, dtype: float64

Miscellaneous Notes

Categorical data can be constructed using astype('category') in Pandas. Read more about categorical data if you need the information.

In [221]:
auto['cylinders'] = auto['cylinders'].astype('category')
auto['cylinders'].describe()
Out[221]:
count     392
unique      5
top         4
freq      199
Name: cylinders, dtype: int64

Homework Starter

Easy access to ISL datasets if you have internet access.

In [226]:
college = pd.read_csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")