For my project, I was interested in looking at the correlation between the college a student attended and the likelihood of them receiving a scholarship. I found the data on Data.gov authored by the New York State Higher Education Services Corporation. The data provides information on the name of the NY college, the sector group to which that college belongs, the amount of recipients of a scholarship, their full-time equivelance, and the total amount of funding that school received from scholarships all dating back to 2009.
In the beginning on the project, it was clear that in order to look at the correlation between college a student attended and the likelihood of them receiving a scholarship, the data related to college would need to be numerical in some way as the program was not function with text. In order to address that, I decided to find the correlation between sector groups (grouping of colleges) and likelihood of a scholarship as the different sectino groups were associated with a number.
I attemped to look into the panda documentation on their website to find a search & replace function to identify all instances of a sector group and replace it with their corresponding number. For example, I waI nted to change all "1-CUNY CC" to just simply "1". With that change, the data would keep the same information but would now be funcation with the programming. Afte unfortunately not being able to get the syntax to work for the function I found in the pandas library, I manually made the change. Later in the project, I realized the the logical tests used in the example to simplify income, can also be used as a find & replace function. I undid the changes I made manual to the csv document and instead used code to select each of the 9 college sector groups and replace them with a singular number. To further clean up the data, I deleted several columns that I did not want to address in the statistical analysis and renamed them for simplicity sake. Finally, I ran the function to ensure the data was in fact number and to remove any entries that werent.
Next, I defind two sets of data: one with the independent variable as Groups and the other with the independent variable as College. The objective was to see which of the two was more accurate in predicting whether or not a student would have a scholarship. Finally, I ran linear regressions on both of them to produce R-Squared and Accuracy values. For the data set labeled "Group_df", the R-Squared and Accuracy values were 0.1185 and 0.0836 respectively. What this means, is that the model accounts for only 11.85% of the variance in the data and is correct in its prediction only 8.36% of the time. In running the second data set labeled "College_df", there were slight but interesting changes in the R-Squared and Accuracy values. For the R-Squared value, the new data set accounted for 13.98% of the variance in the data a small improvement from the previous data set. However, the Accuracy value decreased to 0.805 and therefore the model was correctly predictive only 8.05% of the time.
import os
try:
inputFunc = raw_input
except NameError:
inputFunc = input
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
import seaborn as sns
from statsmodels.formula.api import ols
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import random
# Custom functions
def evaluate(pred, labels_test):
acc = accuracy_score(pred, labels_test)
print ("Accuracey: %s"%acc)
tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()
recall = tp / (tp + fp)
percision = tp / (tp + fn)
f1 = (2 / ((1/recall)+(1/percision)))
print ("")
print ("True Negatives: %s"%tn)
print ("False Positives: %s"%fp)
print ("False Negatives: %s"%fn)
print ("True Positives: %s"%tp)
print ("Recall: %s"%recall)
print ("Precision: %s"%percision)
print ("F1 Score: %s"%f1)
def plot_bound(Z_val,data,col1,col2,binary):
# Z-val equals "Yes" value. E.g., "Y" or "1".
# data equals df
# col1 and col2 defines which colums to use from data
# Plot binary decision boundary.
# For this, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10
x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
y_min = 0.0;
y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
h_x = (x_max-x_min)/100 # step size in the mesh
h_y = (y_max-y_min)/100 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
if binary == 1:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = np.where(Z=="Y",1,0)
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z)
plt.show()
Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.
# Load and peek at your data. Change the file name as needed.
raw_data_df = pd.read_csv('scholarships.csv', parse_dates=[0])
raw_data_df.head()
# for multiple columns
processed_data_df = raw_data_df.drop([
'Federal School Code',
'TAP College Name',
'Academic Year',
], 1)
processed_data_df.head()
# You can rename columns like so.
processed_data_df = processed_data_df.rename(columns={
'TAP College Code': 'College',
'TAP Sector Group': 'Group',
'Scholarship Headcount': 'Headcount',
'Scholarship Dollars': 'Amount',
'Scholarship FTE': 'FTE',
})
processed_data_df.head()
processed_data_df.loc[processed_data_df['Group'] == '1-CUNY SR', 'Group'] = 1
processed_data_df.loc[processed_data_df['Group'] == '2-CUNY CC', 'Group'] = 2
processed_data_df.loc[processed_data_df['Group'] == '3-SUNY SO', 'Group'] = 3
processed_data_df.loc[processed_data_df['Group'] == '4-SUNY CC', 'Group'] = 4
processed_data_df.loc[processed_data_df['Group'] == '5-INDEPENDENT', 'Group'] = 5
processed_data_df.loc[processed_data_df['Group'] == '6-BUS. DEGREE', 'Group'] = 6
processed_data_df.loc[processed_data_df['Group'] == '7-BUS. NON-DEG', 'Group'] = 7
processed_data_df.loc[processed_data_df['Group'] == '8-OTHER', 'Group'] = 8
processed_data_df.loc[processed_data_df['Group'] == 'VOCATIONAL - VET SCHOOLS ONLY', 'Group'] = 9
processed_data_df
# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
processed_data_df = processed_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
processed_data_df = processed_data_df.dropna()
processed_data_df.head()
Group_df = processed_data_df[[
'Group',
'Headcount',
'FTE',
]].copy()
Group_df.head()
College_df = processed_data_df[[
'College',
'Headcount',
'FTE',
]].copy()
College_df.head()
data = Group_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
sns.lmplot(x="Group", y="Headcount", data=training, x_estimator=np.mean, order=2)
sns.lmplot(x="Group", y="FTE", data=training, x_estimator=np.mean, order=1)
model = ols("Group ~ Headcount + FTE", training).fit()
#model = ols("Group ~ Headcount + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()
# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("Group", axis=1).as_matrix(columns=None)
labels_train = training["Group"].as_matrix(columns=None)
features_test = holdout.drop("Group", axis=1).as_matrix(columns=None)
labels_test = holdout["Group"].as_matrix(columns=None)
lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)
data = College_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
sns.lmplot(x="College", y="Headcount", data=training, x_estimator=np.mean, order=2)
sns.lmplot(x="College", y="FTE", data=training, x_estimator=np.mean, order=2)
model = ols("College ~ Headcount + FTE", training).fit()
model.summary()
# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("College", axis=1).as_matrix(columns=None)
labels_train = training["College"].as_matrix(columns=None)
features_test = holdout.drop("College", axis=1).as_matrix(columns=None)
labels_test = holdout["College"].as_matrix(columns=None)
lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)