Project Three Notebook

For my project, I was interested in looking at the correlation between the college a student attended and the likelihood of them receiving a scholarship. I found the data on Data.gov authored by the New York State Higher Education Services Corporation. The data provides information on the name of the NY college, the sector group to which that college belongs, the amount of recipients of a scholarship, their full-time equivelance, and the total amount of funding that school received from scholarships all dating back to 2009.

Process & Development

In the beginning on the project, it was clear that in order to look at the correlation between college a student attended and the likelihood of them receiving a scholarship, the data related to college would need to be numerical in some way as the program was not function with text. In order to address that, I decided to find the correlation between sector groups (grouping of colleges) and likelihood of a scholarship as the different sectino groups were associated with a number.

I attemped to look into the panda documentation on their website to find a search & replace function to identify all instances of a sector group and replace it with their corresponding number. For example, I waI nted to change all "1-CUNY CC" to just simply "1". With that change, the data would keep the same information but would now be funcation with the programming. Afte unfortunately not being able to get the syntax to work for the function I found in the pandas library, I manually made the change. Later in the project, I realized the the logical tests used in the example to simplify income, can also be used as a find & replace function. I undid the changes I made manual to the csv document and instead used code to select each of the 9 college sector groups and replace them with a singular number. To further clean up the data, I deleted several columns that I did not want to address in the statistical analysis and renamed them for simplicity sake. Finally, I ran the function to ensure the data was in fact number and to remove any entries that werent.

Next, I defind two sets of data: one with the independent variable as Groups and the other with the independent variable as College. The objective was to see which of the two was more accurate in predicting whether or not a student would have a scholarship. Finally, I ran linear regressions on both of them to produce R-Squared and Accuracy values. For the data set labeled "Group_df", the R-Squared and Accuracy values were 0.1185 and 0.0836 respectively. What this means, is that the model accounts for only 11.85% of the variance in the data and is correct in its prediction only 8.36% of the time. In running the second data set labeled "College_df", there were slight but interesting changes in the R-Squared and Accuracy values. For the R-Squared value, the new data set accounted for 13.98% of the variance in the data a small improvement from the previous data set. However, the Accuracy value decreased to 0.805 and therefore the model was correctly predictive only 8.05% of the time.

In [316]:
import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()

Data Cleaning

Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.

In [317]:
# Load and peek at your data. Change the file name as needed. 
raw_data_df = pd.read_csv('scholarships.csv', parse_dates=[0]) 
raw_data_df.head()
Out[317]:
Academic Year TAP College Code Federal School Code TAP College Name TAP Sector Group Scholarship Headcount Scholarship FTE Scholarship Dollars
0 2016-01-01 1410 2687 CUNY BROOKLYN COLLEGE 1-CUNY SR 288 234.33 306098.35
1 2016-01-01 875 2838 SUNY STONY BROOK 3-SUNY SO 1139 1047.00 2278194.14
2 2016-01-01 1402 2694 CUNY KINGSBOROUGH CC 2-CUNY CC 32 22.50 59932.25
3 2016-01-01 5083 2735 HILBERT COLLEGE 5-INDEPENDENT 1 0.50 225.00
4 2016-01-01 140 2705 COLLEGE OF SAINT ROSE 5-INDEPENDENT 68 64.50 108317.00
In [ ]:
 
In [318]:
# for multiple columns
processed_data_df = raw_data_df.drop([
                                            'Federal School Code',
                                            'TAP College Name',
                                            'Academic Year',
                                           ], 1)
processed_data_df.head()
Out[318]:
TAP College Code TAP Sector Group Scholarship Headcount Scholarship FTE Scholarship Dollars
0 1410 1-CUNY SR 288 234.33 306098.35
1 875 3-SUNY SO 1139 1047.00 2278194.14
2 1402 2-CUNY CC 32 22.50 59932.25
3 5083 5-INDEPENDENT 1 0.50 225.00
4 140 5-INDEPENDENT 68 64.50 108317.00
In [319]:
# You can rename columns like so.
processed_data_df = processed_data_df.rename(columns={ 
                                                        'TAP College Code': 'College',
                                                        'TAP Sector Group': 'Group',
                                                        'Scholarship Headcount': 'Headcount',
                                                        'Scholarship Dollars': 'Amount',
                                                        'Scholarship FTE': 'FTE',
                                                     })
processed_data_df.head()
Out[319]:
College Group Headcount FTE Amount
0 1410 1-CUNY SR 288 234.33 306098.35
1 875 3-SUNY SO 1139 1047.00 2278194.14
2 1402 2-CUNY CC 32 22.50 59932.25
3 5083 5-INDEPENDENT 1 0.50 225.00
4 140 5-INDEPENDENT 68 64.50 108317.00
In [320]:
processed_data_df.loc[processed_data_df['Group'] == '1-CUNY SR', 'Group'] = 1 
processed_data_df.loc[processed_data_df['Group'] == '2-CUNY CC', 'Group'] = 2
processed_data_df.loc[processed_data_df['Group'] == '3-SUNY SO', 'Group'] = 3 
processed_data_df.loc[processed_data_df['Group'] == '4-SUNY CC', 'Group'] = 4 
processed_data_df.loc[processed_data_df['Group'] == '5-INDEPENDENT', 'Group'] = 5 
processed_data_df.loc[processed_data_df['Group'] == '6-BUS. DEGREE', 'Group'] = 6 
processed_data_df.loc[processed_data_df['Group'] == '7-BUS. NON-DEG', 'Group'] = 7 
processed_data_df.loc[processed_data_df['Group'] == '8-OTHER', 'Group'] = 8 
processed_data_df.loc[processed_data_df['Group'] == 'VOCATIONAL - VET SCHOOLS ONLY', 'Group'] = 9

processed_data_df
Out[320]:
College Group Headcount FTE Amount
0 1410 1 288 234.33 306098.35
1 875 3 1139 1047.00 2278194.14
2 1402 2 32 22.50 59932.25
3 5083 5 1 0.50 225.00
4 140 5 68 64.50 108317.00
5 2321 6 4 2.00 6995.01
6 2042 4 41 34.67 65402.90
7 1040 5 54 51.25 180206.00
8 2265 5 2 2.00 17020.00
9 3015 3 9 7.00 51952.50
10 295 5 62 56.50 104770.50
11 6480 6 1 0.33 5041.57
12 5600 5 1 1.00 6470.01
13 525 5 252 244.04 261930.51
14 2260 4 13 10.25 31701.50
15 1050 5 34 32.83 27823.33
16 2105 4 35 29.75 36395.20
17 2036 4 8 6.00 8200.00
18 1414 1 168 138.00 250432.50
19 2243 4 8 4.79 5895.00
20 2230 4 13 9.00 19201.10
21 845 5 5 5.00 2500.00
22 505 5 31 26.79 51234.55
23 447 5 13 12.50 53900.00
24 215 5 80 76.50 108900.00
25 130 5 25 23.00 59498.00
26 7830 6 7 6.50 21188.50
27 8501 9 6 3.00 19218.00
28 6010 3 45 35.79 98068.33
29 5775 5 4 1.75 54447.66
... ... ... ... ... ...
2446 1653 8 2 2.00 931.00
2447 5625 3 9 3.92 18675.62
2448 5620 1 6 3.54 77707.18
2449 1650 8 3 2.50 2945.40
2450 5610 3 2 1.50 6574.10
2451 5605 3 4 2.63 11427.44
2452 340 5 4 4.00 2720.00
2453 751 5 536 516.50 715305.79
2454 1620 8 1 1.00 441.00
2455 5600 5 1 1.00 4995.00
2456 5590 3 2 1.50 4041.82
2457 1420 1 5 2.13 6512.10
2458 5580 5 1 0.50 2447.55
2459 5575 5 1 1.00 4895.10
2460 745 5 1 0.50 735.00
2461 1418 1 68 58.25 58721.36
2462 5565 3 5 3.13 74858.36
2463 5555 3 1 0.50 2447.54
2464 1417 1 107 90.33 232112.06
2465 5540 3 1 1.00 4895.10
2466 5533 5 1 0.50 2447.55
2467 10 5 159 147.50 280837.33
2468 11 5 34 31.38 114755.89
2469 35 5 51 46.50 39645.00
2470 120 5 322 305.50 260102.20
2471 168 5 87 86.00 83000.00
2472 330 5 320 311.75 388024.75
2473 735 5 178 171.13 180395.99
2474 1416 1 353 321.79 305531.11
2475 5495 5 1 1.00 4895.10

2476 rows × 5 columns

In [321]:
# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
processed_data_df = processed_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
processed_data_df = processed_data_df.dropna()
processed_data_df.head()
Out[321]:
College Group Headcount FTE Amount
0 1410 1 288 234.33 306098.35
1 875 3 1139 1047.00 2278194.14
2 1402 2 32 22.50 59932.25
3 5083 5 1 0.50 225.00
4 140 5 68 64.50 108317.00
In [322]:
Group_df = processed_data_df[[
                               'Group', 
                               'Headcount',
                                'FTE',
                               ]].copy()
Group_df.head()
Out[322]:
Group Headcount FTE
0 1 288 234.33
1 3 1139 1047.00
2 2 32 22.50
3 5 1 0.50
4 5 68 64.50
In [323]:
College_df = processed_data_df[[
                               'College', 
                               'Headcount',
                                'FTE',
                               ]].copy()
College_df.head()
Out[323]:
College Headcount FTE
0 1410 288 234.33
1 875 1139 1047.00
2 1402 32 22.50
3 5083 1 0.50
4 140 68 64.50

Group_df

In [324]:
data = Group_df

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
In [325]:
sns.lmplot(x="Group", y="Headcount", data=training, x_estimator=np.mean, order=2)
Out[325]:
<seaborn.axisgrid.FacetGrid at 0x269018dccc0>
In [326]:
sns.lmplot(x="Group", y="FTE", data=training, x_estimator=np.mean, order=1)
Out[326]:
<seaborn.axisgrid.FacetGrid at 0x269019c25c0>
In [327]:
model = ols("Group ~ Headcount + FTE", training).fit()
#model = ols("Group ~ Headcount + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()
Out[327]:
OLS Regression Results
Dep. Variable: Group R-squared: 0.118
Model: OLS Adj. R-squared: 0.118
Method: Least Squares F-statistic: 157.8
Date: Tue, 05 Dec 2017 Prob (F-statistic): 4.87e-65
Time: 01:19:18 Log-Likelihood: -4351.9
No. Observations: 2352 AIC: 8710.
Df Residuals: 2349 BIC: 8727.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.8870 0.036 137.107 0.000 4.817 4.957
Headcount -0.0467 0.004 -13.096 0.000 -0.054 -0.040
FTE 0.0471 0.004 12.317 0.000 0.040 0.055
Omnibus: 200.421 Durbin-Watson: 1.275
Prob(Omnibus): 0.000 Jarque-Bera (JB): 529.286
Skew: 0.476 Prob(JB): 1.17e-115
Kurtosis: 5.120 Cond. No. 241.
In [329]:
# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("Group", axis=1).as_matrix(columns=None)
labels_train = training["Group"].as_matrix(columns=None)

features_test = holdout.drop("Group", axis=1).as_matrix(columns=None)
labels_test = holdout["Group"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)
R squared: 0.118458728474
Accuracy: 0.0836153708378

College_df

In [330]:
data = College_df

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
In [331]:
sns.lmplot(x="College", y="Headcount", data=training, x_estimator=np.mean, order=2)
Out[331]:
<seaborn.axisgrid.FacetGrid at 0x26901b17400>
In [332]:
sns.lmplot(x="College", y="FTE", data=training, x_estimator=np.mean, order=2)
Out[332]:
<seaborn.axisgrid.FacetGrid at 0x269033e7048>
In [333]:
model = ols("College ~ Headcount + FTE", training).fit()
model.summary()
Out[333]:
OLS Regression Results
Dep. Variable: College R-squared: 0.140
Model: OLS Adj. R-squared: 0.139
Method: Least Squares F-statistic: 191.0
Date: Tue, 05 Dec 2017 Prob (F-statistic): 1.41e-77
Time: 01:21:01 Log-Likelihood: -21743.
No. Observations: 2352 AIC: 4.349e+04
Df Residuals: 2349 BIC: 4.351e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3692.6448 57.972 63.697 0.000 3578.963 3806.327
Headcount -3.4888 5.747 -0.607 0.544 -14.759 7.781
FTE -3.9510 6.167 -0.641 0.522 -16.044 8.142
Omnibus: 820.051 Durbin-Watson: 1.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 194.231
Skew: 0.461 Prob(JB): 6.66e-43
Kurtosis: 1.937 Cond. No. 239.
In [334]:
# Rerun with SciKitLearn because it's easy to check accuracy

features_train = training.drop("College", axis=1).as_matrix(columns=None)
labels_train = training["College"].as_matrix(columns=None)

features_test = holdout.drop("College", axis=1).as_matrix(columns=None)
labels_test = holdout["College"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)
R squared: 0.139865428885
Accuracy: 0.0804937604921
In [ ]: