Project Three Notebook - Anyone Home?¶

I downloaded my data from https://archive.ics.uci.edu/ml/machine-learning-databases/00357/. The site in which I got my data is as follows: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+. It's an archive of machine learning datasets. My particular dataset is directed to Occupancy Detection based on various variables including temperature, light, humidity, CO2, and HumidityRatio.

In regards to "Data Wrangling" I ensured that I only had number data. There is a date column in my csv file that I did not remove, but when I coereced the data, the date became numbers. I did not use the date column for any of my models.

In regards to "Training and Cross Validation" I used a linear regression comparing all the different variables from my dataset minus the date. The plots and the actual validations make me believe that it's not a very good model. I attempted to do a logarithmic regression as well, but when I actually attempted to run the regression, I had errors that I was unable to solve on my own. Part of the problem my be because I do not really understand the dataset. Another issue is that I have missed the previous class where we discussed this project. I have ensured that I was able to run the code samples and complete this assignment.

In regards to a "Working Model" I believe I have a working model, but I do not believe it's a very good model.

import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()

Data Cleaning¶

Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.

# Load and peek at your data. Change the file name as needed. 
raw_data_df = pd.read_csv('dataset.csv', parse_dates=[0]) 
raw_data_df.head()

# You can explore unique entires by stating the column and using .unique() like this:
print(raw_data_df["Light"].unique())
print(raw_data_df["Temperature"].unique())

[  585.2          578.4          572.6666667    493.75         488.6
   568.6666667    536.3333333    509.           476.           510.           481.5
   481.8          475.25         469.           464.           455.           454.
   458.           473.           498.4          530.2          533.6
   524.25         498.6666667    516.3333333    501.2          522.           520.5
   505.2857143    488.3333333    512.           511.           501.5
   503.6666667    483.1666667    483.5          470.3333333    480.1428571
   474.4          470.           445.           439.           444.25
   454.75         455.8          451.6          457.5          471.5
   475.6666667    467.75         466.5          467.           459.           449.
   462.3333333    456.5          457.           450.25         446.           448.
   444.           435.           432.           429.           432.75
   436.5          438.           435.4285714    434.           441.5          441.
   440.25         433.5          436.2          442.5          443.4
   439.8          435.5          430.75         440.6          428.2
   432.2          431.75         429.2          433.           428.25
   425.4          426.           430.2          428.3333333    424.6
   422.5          423.6666667    427.4          421.8          419.           416.2
   418.6          415.           310.25           0.           217.2
   413.6666667    412.2          413.5          403.5          397.4
   399.5          398.3333333    403.           407.6666667    413.           414.
   407.4          407.           405.           412.           416.6666667
   440.           434.5          446.6666667    449.3333333    454.6666667
   453.           449.5          460.           442.           463.
   457.75         461.           460.25         479.           474.
   472.3333333    477.75         484.           483.           489.           482.
   480.6          495.25         484.8333333    499.           490.6          502.
   493.           495.6666667    499.4          500.6          506.4285714
   503.           518.           513.           511.3333333    518.2          519.
   528.           532.8          534.25         523.           526.           538.
   532.25         547.5          544.2          535.           534.2
   533.6666667    547.           553.4          550.5          543.6666667
   550.75         555.           557.6666667    556.4          546.5          543.
   544.25         548.           535.7142857    549.3333333    538.6
   548.6          548.5          556.6666667    562.           555.6          558.
   563.4          570.           565.8333333    553.           561.8333333
   562.4          560.6          559.6666667    569.8333333    581.           593.
   601.           600.4          616.           603.6          587.5          575.
   579.2          584.8          609.6666667    625.6666667    629.6666667
   617.           624.8          641.3333333    659.4          644.6          653.
   659.           660.           668.5          652.5          652.75
   654.2          656.1666667    650.75         663.75         651.
   648.3333333    657.8          668.           663.6666667    664.6
   662.3333333    647.8          651.6666667    650.           653.75
   656.           655.           653.1666667    659.75         338.3333333
   197.1666667    184.           173.2          193.           178.           186.
   173.5          174.5          169.6666667    168.5          171.           176.
   177.25         268.7142857    538.75         627.6          638.           634.
   629.           597.75         561.5          598.4          609.6          587.
   596.8          595.           608.3333333    611.6          610.2
   616.6          609.           599.           594.8          585.75
   592.75         586.5          585.           576.75         566.           560.
   569.3333333    561.           558.3          558.8          546.           542.8
   544.8          541.4          538.4          534.5          538.8          531.
   526.8333333    526.75         519.6666667    505.8          503.5
   496.2          516.75         503.7142857    498.           481.6
   483.4          485.           478.           478.75         465.25
   466.           465.6666667    465.           460.4285714    456.1428571
   456.8          444.75         441.2857143    447.           441.4
   438.6          443.5          429.5          415.5           56.33333333
   311.75         409.8          414.6          407.5          406.6
   406.2          411.           408.           413.4          437.6666667
   447.2          459.25         470.2          475.75         471.
   480.25         487.           503.6          509.5          527.25
   524.5          529.8          534.           544.           542.5
   559.2          567.8          554.           563.           564.5
   577.5          578.8          581.6666667    580.           596.           603.4
   608.25         606.6666667    612.           611.           624.75
   630.4          632.5          652.           643.5          648.6666667
   767.          1419.5         1697.25        1209.8          685.75
   684.75         679.4          683.           692.           693.8          689.
   693.75         686.5          698.25         694.8333333    690.           696.8
   710.           722.           732.           735.           719.2
   723.3333333    729.           738.           734.4          747.           745.2
   746.3333333    737.8          745.           740.5          754.           763.
   758.25         755.5          772.75         774.5          769.8
   758.3333333    775.75         775.           783.4          789.6
   797.5          799.           782.5          786.1666667    794.25
   787.           805.           798.           793.           801.4          808.
   809.8          817.           813.        ]
[ 23.7         23.718       23.73        23.7225      23.754       23.76
  23.736       23.745       23.6         23.64        23.65        23.625
  23.61666667  23.66666667  23.575       23.54        23.525       23.5
  23.445       23.39        23.37        23.37333333  23.32333333  23.365
  23.29        23.30666667  23.272       23.2         23.23        23.215
  23.14        23.15        23.18        23.11666667  23.1         23.06
  23.02        23.01        23.          22.98166667  22.945       22.956
  22.9725      22.912       22.89        22.865       22.87333333  22.815
  22.84        22.79        22.772       22.76        22.7         22.7225
  22.68        22.64        22.6         22.66666667  22.62        22.625
  22.65        22.56        22.58        22.575       22.55        22.56666667
  22.525       22.54        22.5         22.52        22.478       22.456
  22.4175      22.39        22.35        22.32333333  22.365       22.315
  22.29        22.26        22.254       22.23        22.2         22.13333333
  22.1         22.15        22.06666667  22.          21.978       21.945
  21.92666667  21.89        21.815       21.82333333  21.79        21.7675
  21.754       21.7         21.64        21.6         21.5         21.55
  21.4725      21.39        21.42666667  21.365       21.32333333  21.34
  21.315       21.35666667  21.29        21.254       21.23        21.218
  21.2         21.175       21.16666667  21.125       21.1         21.13333333
  21.075       21.03333333  21.02        21.025       21.          20.9725
  20.96333333  20.89        20.9175      20.912       20.92666667  20.934
  20.945       20.85666667  20.865       20.87        20.85        20.81
  20.79        20.82333333  20.772       20.73        20.745       20.76
  20.7225      20.7675      20.7         20.718       20.736       20.675
  20.66666667  20.625       20.65        20.64        20.63333333  20.6
  20.575       20.58        20.55        20.56666667  20.525       20.53333333
  20.56        20.52        20.5         20.46333333  20.478       20.456
  20.42666667  20.4175      20.4725      20.445       20.434       20.39
  20.35        20.365       20.34        20.315       20.35666667  20.29
  20.32333333  20.272       20.26        20.2675      20.37        20.245
  20.236       20.2         20.218       20.2225      20.31        20.33
  20.754       20.775       20.815       20.84        20.83        20.87333333
  20.978       21.15        21.18333333  21.236       21.2225      21.245
  21.26        21.30666667  21.31        21.35        21.4175      21.412
  21.456       21.48166667  21.54        21.625       21.66        21.68333333
  21.65        21.718       21.76        21.73        21.775       21.77714286
  21.84        21.87        21.912       21.90833333  21.934       21.98166667
  22.025       22.06        22.05        22.08        22.08333333
  22.11666667  22.16666667  22.18        22.16        22.175       22.18333333
  22.2225      22.245       22.272       22.30666667  22.33        22.434
  22.42666667  22.675       22.745       22.87        22.90833333
  22.96333333  23.025       23.05        23.2225      23.26        23.2675
  23.245       23.254       23.18571429  23.18333333  23.175       23.218
  23.33        23.35        23.315       23.31857143  23.30428571  23.236
  23.125       23.08571429  23.06666667  23.01666667  22.96857143  22.9175
  22.92666667  22.93714286  22.934       22.7675      22.73        22.715
  22.736       22.718       22.66        22.445       22.35666667  22.2675
  22.125       21.9175      21.96333333  21.81        21.7225      21.675
  21.63333333  21.58        21.56        21.53333333  21.16        21.06
  21.06666667  21.08        21.05        20.68        20.66        20.62
  20.61666667  20.54        20.48166667  20.71285714  20.956       21.08333333
  21.11666667  21.215       21.272       21.865       22.08571429  22.14
  22.82333333  23.31        23.456       23.58        23.79        23.89
  23.98428571  24.          24.05        24.08        24.1         24.175
  24.2         24.218       24.26        24.29        24.33        24.35666667
  24.40833333]

# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
raw_data_df = raw_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
raw_data_df = raw_data_df.dropna()
raw_data_df.head()

# I'm now going to make a set of tables to be used in training some models
# The first set will be for linear regressions where the traget is numeric.
# Happiness
occupancy_lin_df = raw_data_df[[
                               'Occupancy', 
                               'Temperature', 
                               'Humidity',
                                'Light',
                                'CO2',
                                'HumidityRatio'
                               ]].copy()
occupancy_lin_df.head()

# The second set will be for classifiers where the target is a class.
# Happiness
occupancy_class_df = raw_data_df[[
                               'Occupancy', 
                               'Temperature', 
                               'Humidity',
                                'Light',
                                'CO2',
                                'HumidityRatio'
                               ]].copy()
occupancy_class_df.head()

Training and Validation¶

Above I created two datasets worth exploring:

occupancy_lin_df. The data needed to access happiness along a continuous variable.
occupancy_class_df. The data needed to access happiness as a categorical variable.

Let's take them each in turn.

occupancy_lin_df¶

data = occupancy_lin_df

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

sns.lmplot(x="Temperature", y="Occupancy", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0xf88e30>

sns.lmplot(x="Humidity", y="Occupancy", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0x6ff7ab0>

sns.lmplot(x="Light", y="Occupancy", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0xdd8d3d0>

sns.lmplot(x="CO2", y="Occupancy", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0x7d9e190>

sns.lmplot(x="HumidityRatio", y="Occupancy", data=training, x_estimator=np.mean, order=1)

<seaborn.axisgrid.FacetGrid at 0xde3a410>

model = ols("Occupancy ~ Temperature + Humidity + Light + CO2 + HumidityRatio", training).fit()
#model = ols("happy ~ age + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()

# Rerun with SciKitLearn because it's easy to check accuracy
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labemodel = ols("Occupancy ~ Temperature + Humidity + Light + CO2 + HumidityRatio", training).fit()
#model = ols("happy ~ age + income + np.power(age, 2) + np.power(income, 2)", training).fit()
model.summary()ls_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)

R squared: 0.889383481429
Accuracy: 0.836508277887

occupancy_class_df¶

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("Humidity")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("CO2")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("Light")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Temperature")
plt.ylabel("HumidityRatio")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("Humidity")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("CO2")
plt.show()

Percentage of Ys: 0.3647279549718574

data = occupancy_class_df
holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_train = training["Occupancy"].as_matrix(columns=None)

features_test = holdout.drop("Occupancy", axis=1).as_matrix(columns=None)
labels_test = holdout["Occupancy"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of Ys: %s\n"%(len(data[data["Occupancy"]== 1 ])/len(data)))

#### initial visualization
feature_1_no = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_2_no = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 0]
feature_1_yes = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
feature_2_yes = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]== 1]
plt.scatter(feature_1_yes, feature_2_yes, color = "g", label="Occupied")
plt.scatter(feature_1_no, feature_2_no, color = "r", label="Not Occupied")
plt.legend()
plt.xlabel("Light")
plt.ylabel("HumidityRatio")
plt.show()

Percentage of Ys: 0.3647279549718574

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
0	2015-02-02 14:19:00	23.7000	26.272	585.200000	749.200000	0.004764	1
1	2015-02-02 14:19:00	23.7180	26.290	578.400000	760.400000	0.004773	1
2	2015-02-02 14:21:00	23.7300	26.230	572.666667	769.666667	0.004765	1
3	2015-02-02 14:22:00	23.7225	26.125	493.750000	774.750000	0.004744	1
4	2015-02-02 14:23:00	23.7540	26.200	488.600000	779.000000	0.004767	1

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
0	1422886740000000000	23.7000	26.272	585.200000	749.200000	0.004764	1
1	1422886740000000000	23.7180	26.290	578.400000	760.400000	0.004773	1
2	1422886860000000000	23.7300	26.230	572.666667	769.666667	0.004765	1
3	1422886920000000000	23.7225	26.125	493.750000	774.750000	0.004744	1
4	1422886980000000000	23.7540	26.200	488.600000	779.000000	0.004767	1

	Occupancy	Temperature	Humidity	Light	CO2	HumidityRatio
0	1	23.7000	26.272	585.200000	749.200000	0.004764
1	1	23.7180	26.290	578.400000	760.400000	0.004773
2	1	23.7300	26.230	572.666667	769.666667	0.004765
3	1	23.7225	26.125	493.750000	774.750000	0.004744
4	1	23.7540	26.200	488.600000	779.000000	0.004767

	Occupancy	Temperature	Humidity	Light	CO2	HumidityRatio
0	1	23.7000	26.272	585.200000	749.200000	0.004764
1	1	23.7180	26.290	578.400000	760.400000	0.004773
2	1	23.7300	26.230	572.666667	769.666667	0.004765
3	1	23.7225	26.125	493.750000	774.750000	0.004744
4	1	23.7540	26.200	488.600000	779.000000	0.004767

Dep. Variable:	Occupancy	R-squared:	0.889
Model:	OLS	Adj. R-squared:	0.889
Method:	Least Squares	F-statistic:	4062.
Date:	Mon, 04 Dec 2017	Prob (F-statistic):	0.00
Time:	22:53:20	Log-Likelihood:	1046.7
No. Observations:	2532	AIC:	-2081.
Df Residuals:	2526	BIC:	-2046.
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	10.7904	0.813	13.274	0.000	9.196	12.384
Temperature	-0.5494	0.038	-14.362	0.000	-0.624	-0.474
Humidity	-0.2518	0.023	-10.868	0.000	-0.297	-0.206
Light	0.0019	2.44e-05	76.027	0.000	0.002	0.002
CO2	-7.87e-05	5.08e-05	-1.550	0.121	-0.000	2.09e-05
HumidityRatio	1846.1456	149.148	12.378	0.000	1553.681	2138.610

Omnibus:	1838.587	Durbin-Watson:	0.413
Prob(Omnibus):	0.000	Jarque-Bera (JB):	56402.300
Skew:	-3.072	Prob(JB):	0.00
Kurtosis:	25.290	Cond. No.	3.83e+07