Project Three Notebook¶

For my project, I was interested in looking at the correlation between the college a student attended and the likelihood of them receiving a scholarship. I found the data on Data.gov authored by the New York State Higher Education Services Corporation. The data provides information on the name of the NY college, the sector group to which that college belongs, the amount of recipients of a scholarship, their full-time equivelance, and the total amount of funding that school received from scholarships all dating back to 2009.

Process & Development¶

In the beginning on the project, it was clear that in order to look at the correlation between college a student attended and the likelihood of them receiving a scholarship, the data related to college would need to be numerical in some way as the program was not function with text. In order to address that, I decided to find the correlation between sector groups (grouping of colleges) and likelihood of a scholarship as the different sectino groups were associated with a number.

I attemped to look into the panda documentation on their website to find a search & replace function to identify all instances of a sector group and replace it with their corresponding number. For example, I waI nted to change all "1-CUNY CC" to just simply "1". With that change, the data would keep the same information but would now be funcation with the programming. Afte unfortunately not being able to get the syntax to work for the function I found in the pandas library, I manually made the change. Later in the project, I realized the the logical tests used in the example to simplify income, can also be used as a find & replace function. I undid the changes I made manual to the csv document and instead used code to select each of the 9 college sector groups and replace them with a singular number. To further clean up the data, I deleted several columns that I did not want to address in the statistical analysis and renamed them for simplicity sake. Finally, I ran the function to ensure the data was in fact number and to remove any entries that werent.

Next, I defind two sets of data: one with the independent variable as Groups and the other with the independent variable as College. The objective was to see which of the two was more accurate in predicting whether or not a student would have a scholarship. Finally, I ran linear regressions on both of them to produce R-Squared and Accuracy values. For the data set labeled "Group_df", the R-Squared and Accuracy values were 0.1185 and 0.0836 respectively. What this means, is that the model accounts for only 11.85% of the variance in the data and is correct in its prediction only 8.36% of the time. In running the second data set labeled "College_df", there were slight but interesting changes in the R-Squared and Accuracy values. For the R-Squared value, the new data set accounted for 13.98% of the variance in the data a small improvement from the previous data set. However, the Accuracy value decreased to 0.805 and therefore the model was correctly predictive only 8.05% of the time.

import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()

Data Cleaning¶

Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.

# Load and peek at your data. Change the file name as needed. 
raw_data_df = pd.read_csv('scholarships.csv', parse_dates=[0]) 
raw_data_df.head()

# for multiple columns
processed_data_df = raw_data_df.drop([
                                            'Federal School Code',
                                            'TAP College Name',
                                            'Academic Year',
                                           ], 1)
processed_data_df.head()

# You can rename columns like so.
processed_data_df = processed_data_df.rename(columns={ 
                                                        'TAP College Code': 'College',
                                                        'TAP Sector Group': 'Group',
                                                        'Scholarship Headcount': 'Headcount',
                                                        'Scholarship Dollars': 'Amount',
                                                        'Scholarship FTE': 'FTE',
                                                     })
processed_data_df.head()

processed_data_df.loc[processed_data_df['Group'] == '1-CUNY SR', 'Group'] = 1 
processed_data_df.loc[processed_data_df['Group'] == '2-CUNY CC', 'Group'] = 2
processed_data_df.loc[processed_data_df['Group'] == '3-SUNY SO', 'Group'] = 3 
processed_data_df.loc[processed_data_df['Group'] == '4-SUNY CC', 'Group'] = 4 
processed_data_df.loc[processed_data_df['Group'] == '5-INDEPENDENT', 'Group'] = 5 
processed_data_df.loc[processed_data_df['Group'] == '6-BUS. DEGREE', 'Group'] = 6 
processed_data_df.loc[processed_data_df['Group'] == '7-BUS. NON-DEG', 'Group'] = 7 
processed_data_df.loc[processed_data_df['Group'] == '8-OTHER', 'Group'] = 8 
processed_data_df.loc[processed_data_df['Group'] == 'VOCATIONAL - VET SCHOOLS ONLY', 'Group'] = 9

processed_data_df

# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
processed_data_df = processed_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
processed_data_df = processed_data_df.dropna()
processed_data_df.head()

Group_df = processed_data_df[[
                               'Group', 
                               'Headcount',
                                'FTE',
                               ]].copy()
Group_df.head()

College_df = processed_data_df[[
                               'College', 
                               'Headcount',
                                'FTE',
                               ]].copy()
College_df.head()

Group_df¶

data = Group_df

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

sns.lmplot(x="Group", y="Headcount", data=training, x_estimator=np.mean, order=2)

<seaborn.axisgrid.FacetGrid at 0x269018dccc0>

sns.lmplot(x="Group", y="FTE", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0x269019c25c0>

model = ols("Group ~ Headcount + FTE", training).fit()
#model = ols("Group ~ Headcount + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()

# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("Group", axis=1).as_matrix(columns=None)
labels_train = training["Group"].as_matrix(columns=None)

features_test = holdout.drop("Group", axis=1).as_matrix(columns=None)
labels_test = holdout["Group"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)

R squared: 0.118458728474
Accuracy: 0.0836153708378

College_df¶

data = College_df

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

sns.lmplot(x="College", y="Headcount", data=training, x_estimator=np.mean, order=2)

<seaborn.axisgrid.FacetGrid at 0x26901b17400>

sns.lmplot(x="College", y="FTE", data=training, x_estimator=np.mean, order=2)

<seaborn.axisgrid.FacetGrid at 0x269033e7048>

model = ols("College ~ Headcount + FTE", training).fit()
model.summary()

# Rerun with SciKitLearn because it's easy to check accuracy

features_train = training.drop("College", axis=1).as_matrix(columns=None)
labels_train = training["College"].as_matrix(columns=None)

features_test = holdout.drop("College", axis=1).as_matrix(columns=None)
labels_test = holdout["College"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)

R squared: 0.139865428885
Accuracy: 0.0804937604921

	Academic Year	TAP College Code	Federal School Code	TAP College Name	TAP Sector Group	Scholarship Headcount	Scholarship FTE	Scholarship Dollars
0	2016-01-01	1410	2687	CUNY BROOKLYN COLLEGE	1-CUNY SR	288	234.33	306098.35
1	2016-01-01	875	2838	SUNY STONY BROOK	3-SUNY SO	1139	1047.00	2278194.14
2	2016-01-01	1402	2694	CUNY KINGSBOROUGH CC	2-CUNY CC	32	22.50	59932.25
3	2016-01-01	5083	2735	HILBERT COLLEGE	5-INDEPENDENT	1	0.50	225.00
4	2016-01-01	140	2705	COLLEGE OF SAINT ROSE	5-INDEPENDENT	68	64.50	108317.00

	TAP College Code	TAP Sector Group	Scholarship Headcount	Scholarship FTE	Scholarship Dollars
0	1410	1-CUNY SR	288	234.33	306098.35
1	875	3-SUNY SO	1139	1047.00	2278194.14
2	1402	2-CUNY CC	32	22.50	59932.25
3	5083	5-INDEPENDENT	1	0.50	225.00
4	140	5-INDEPENDENT	68	64.50	108317.00

	College	Group	Headcount	FTE	Amount
0	1410	1-CUNY SR	288	234.33	306098.35
1	875	3-SUNY SO	1139	1047.00	2278194.14
2	1402	2-CUNY CC	32	22.50	59932.25
3	5083	5-INDEPENDENT	1	0.50	225.00
4	140	5-INDEPENDENT	68	64.50	108317.00

	College	Group	Headcount	FTE	Amount
0	1410	1	288	234.33	306098.35
1	875	3	1139	1047.00	2278194.14
2	1402	2	32	22.50	59932.25
3	5083	5	1	0.50	225.00
4	140	5	68	64.50	108317.00
5	2321	6	4	2.00	6995.01
6	2042	4	41	34.67	65402.90
7	1040	5	54	51.25	180206.00
8	2265	5	2	2.00	17020.00
9	3015	3	9	7.00	51952.50
10	295	5	62	56.50	104770.50
11	6480	6	1	0.33	5041.57
12	5600	5	1	1.00	6470.01
13	525	5	252	244.04	261930.51
14	2260	4	13	10.25	31701.50
15	1050	5	34	32.83	27823.33
16	2105	4	35	29.75	36395.20
17	2036	4	8	6.00	8200.00
18	1414	1	168	138.00	250432.50
19	2243	4	8	4.79	5895.00
20	2230	4	13	9.00	19201.10
21	845	5	5	5.00	2500.00
22	505	5	31	26.79	51234.55
23	447	5	13	12.50	53900.00
24	215	5	80	76.50	108900.00
25	130	5	25	23.00	59498.00
26	7830	6	7	6.50	21188.50
27	8501	9	6	3.00	19218.00
28	6010	3	45	35.79	98068.33
29	5775	5	4	1.75	54447.66
...	...	...	...	...	...
2446	1653	8	2	2.00	931.00
2447	5625	3	9	3.92	18675.62
2448	5620	1	6	3.54	77707.18
2449	1650	8	3	2.50	2945.40
2450	5610	3	2	1.50	6574.10
2451	5605	3	4	2.63	11427.44
2452	340	5	4	4.00	2720.00
2453	751	5	536	516.50	715305.79
2454	1620	8	1	1.00	441.00
2455	5600	5	1	1.00	4995.00
2456	5590	3	2	1.50	4041.82
2457	1420	1	5	2.13	6512.10
2458	5580	5	1	0.50	2447.55
2459	5575	5	1	1.00	4895.10
2460	745	5	1	0.50	735.00
2461	1418	1	68	58.25	58721.36
2462	5565	3	5	3.13	74858.36
2463	5555	3	1	0.50	2447.54
2464	1417	1	107	90.33	232112.06
2465	5540	3	1	1.00	4895.10
2466	5533	5	1	0.50	2447.55
2467	10	5	159	147.50	280837.33
2468	11	5	34	31.38	114755.89
2469	35	5	51	46.50	39645.00
2470	120	5	322	305.50	260102.20
2471	168	5	87	86.00	83000.00
2472	330	5	320	311.75	388024.75
2473	735	5	178	171.13	180395.99
2474	1416	1	353	321.79	305531.11
2475	5495	5	1	1.00	4895.10

	College	Group	Headcount	FTE	Amount
0	1410	1	288	234.33	306098.35
1	875	3	1139	1047.00	2278194.14
2	1402	2	32	22.50	59932.25
3	5083	5	1	0.50	225.00
4	140	5	68	64.50	108317.00

Dep. Variable:	Group	R-squared:	0.118
Model:	OLS	Adj. R-squared:	0.118
Method:	Least Squares	F-statistic:	157.8
Date:	Tue, 05 Dec 2017	Prob (F-statistic):	4.87e-65
Time:	01:19:18	Log-Likelihood:	-4351.9
No. Observations:	2352	AIC:	8710.
Df Residuals:	2349	BIC:	8727.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.8870	0.036	137.107	0.000	4.817	4.957
Headcount	-0.0467	0.004	-13.096	0.000	-0.054	-0.040
FTE	0.0471	0.004	12.317	0.000	0.040	0.055

Omnibus:	200.421	Durbin-Watson:	1.275
Prob(Omnibus):	0.000	Jarque-Bera (JB):	529.286
Skew:	0.476	Prob(JB):	1.17e-115
Kurtosis:	5.120	Cond. No.	241.

Omnibus:	820.051	Durbin-Watson:	1.142
Prob(Omnibus):	0.000	Jarque-Bera (JB):	194.231
Skew:	0.461	Prob(JB):	6.66e-43
Kurtosis:	1.937	Cond. No.	239.

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3692.6448	57.972	63.697	0.000	3578.963	3806.327
Headcount	-3.4888	5.747	-0.607	0.544	-14.759	7.781
FTE	-3.9510	6.167	-0.641	0.522	-16.044	8.142