I downloaded my data from https://archive.ics.uci.edu/ml/machine-learning-databases/00357/. The site in which I got my data is as follows: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+. It's an archive of machine learning datasets. My particular dataset is directed to Occupancy Detection based on various variables including temperature, light, humidity, CO2, and HumidityRatio.
In regards to "Data Wrangling" I ensured that I only had number data. There is a date column in my csv file that I did not remove, but when I coereced the data, the date became numbers. I did not use the date column for any of my models.
In regards to "Training and Cross Validation" I used a linear regression comparing all the different variables from my dataset minus the date. The plots and the actual validations make me believe that it's not a very good model. I attempted to do a logarithmic regression as well, but when I actually attempted to run the regression, I had errors that I was unable to solve on my own. Part of the problem my be because I do not really understand the dataset. Another issue is that I have missed the previous class where we discussed this project. I have ensured that I was able to run the code samples and complete this assignment.
In regards to a "Working Model" I believe I have a working model, but I do not believe it's a very good model.
import os
try:
inputFunc = raw_input
except NameError:
inputFunc = input
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
import seaborn as sns
from statsmodels.formula.api import ols
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import random
# Custom functions
def evaluate(pred, labels_test):
acc = accuracy_score(pred, labels_test)
print ("Accuracey: %s"%acc)
tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()
recall = tp / (tp + fp)
percision = tp / (tp + fn)
f1 = (2 / ((1/recall)+(1/percision)))
print ("")
print ("True Negatives: %s"%tn)
print ("False Positives: %s"%fp)
print ("False Negatives: %s"%fn)
print ("True Positives: %s"%tp)
print ("Recall: %s"%recall)
print ("Precision: %s"%percision)
print ("F1 Score: %s"%f1)
def plot_bound(Z_val,data,col1,col2,binary):
# Z-val equals "Yes" value. E.g., "Y" or "1".
# data equals df
# col1 and col2 defines which colums to use from data
# Plot binary decision boundary.
# For this, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10
x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
y_min = 0.0;
y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
h_x = (x_max-x_min)/100 # step size in the mesh
h_y = (y_max-y_min)/100 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
if binary == 1:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = np.where(Z=="Y",1,0)
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z)
plt.show()
Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.
# Load and peek at your data. Change the file name as needed.
raw_data_df = pd.read_csv('dataset.csv', parse_dates=[0])
raw_data_df.head()
# You can explore unique entires by stating the column and using .unique() like this:
print(raw_data_df["Light"].unique())
print(raw_data_df["Temperature"].unique())
# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
raw_data_df = raw_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
raw_data_df = raw_data_df.dropna()
raw_data_df.head()
# I'm now going to make a set of tables to be used in training some models
# The first set will be for linear regressions where the traget is numeric.
# Happiness
occupancy_lin_df = raw_data_df[[
'Occupancy',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio'
]].copy()
occupancy_lin_df.head()
# The second set will be for classifiers where the target is a class.
# Happiness
occupancy_class_df = raw_data_df[[
'Occupancy',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio'
]].copy()
occupancy_class_df.head()
Above I created two datasets worth exploring:
occupancy_lin_df
. The data needed to access happiness along a continuous variable.occupancy_class_df
. The data needed to access happiness as a categorical variable.Let's take them each in turn.
data = occupancy_lin_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
sns.lmplot(x="Temperature", y="Occupancy", data=training, x_estimator=np.mean, order=1)
sns.lmplot(x="Humidity", y="Occupancy", data=training, x_estimator=np.mean, order=1)
sns.lmplot(x="Light", y="Occupancy", data=training, x_estimator=np.mean, order=1)
sns.lmplot(x="CO2", y="Occupancy", data=training, x_estimator=np.mean, order=1)
sns.lmplot(x="HumidityRatio", y="Occupancy", data=training, x_estimator=np.mean, order=1)
model = ols("Occupancy ~ Temperature + Humidity + Light + CO2 + HumidityRatio", training).fit()
#model = ols("happy ~ age + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()
# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labemodel = ols("Occupancy ~ Temperature + Humidity + Light + CO2 + HumidityRatio", training).fit()
#model = ols("happy ~ age + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()ls_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("Humidity")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("CO2")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("Light")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("HumidityRatio")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("Humidity")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("CO2")
plt.show()
data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)
features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)
# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))
#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("HumidityRatio")
plt.show()