Drug Overdose Statistical Evaluation

by Devon Bodey

Data Wrangling

To find sufficient data, I used Data.gov. Once I found a dataset that I found interesting and contained more than 100 data entries with one target variable, I then narrowed down the information by removing irrelevant information. The dataset I chose for this analysis was Accidental Drug Related Deaths 2012-June 2017. A copy of the original dataset can be found here To make the information more relevant and usable for this analysis, I removed columns such as "Case Number", "Death County", and "Location". After removing several variables, I was left with "Age", "Sex" and information indicating which drug lead to the overdose for each victim. Once I finalized cleaning the data, my spreadsheet looked like this. To be properly analyzed, the data needed to contain all numbers. Originally, the data consisted of "Y" and "NaN", in excel, I found and replaced all "Y" with "1" and "NaN" with "0". Furthermore, within the notebook, I coded "Male" as 1 and "Female" as "0" so that there were two opposite variables which could be compared and contrasted against the remaining information.

Training & Cross Validation

Training and Cross Validation was extremely difficult for me given the insufficient dataset that I originally chose to work with. Seeing as I chose a dataset which only included information regarding individuals who had overdosed, it was nearly impossible to form any algorithm to calculate the statistical change of a drug overdose based on age, sex, drug of choice, etc. This is something I did not realize initially. However, simply put, it is impossible to code an algorithm to calculate outcomes on this basis because there is no base information to 'train' off of. The data on its own inferred that all women and men were using drugs and that the drugs that were most common for overdoses were the most deadly or commonly used. However, we know that that information is not true. Or, if it is, this data alone is not enough to prove it.

In order to work on resolving this issue, and in hopes of at least having some data to analyze regardless of accuracy, a fake dataset was added into the code. This dataset mimicked the information contained in the original, however, this time the person did not overdose. This allowed a statistical analysis to be done on overdose rates.

To make the analyses accurate and useful in the real world, I would have had to input more information regarding census data for men and women of certain ages, and rates of use for the drugs being analyzed. I found useful census information here, and this website had strong statistical models on heroin use in the state of Connecticut.

Unfortunately, given the lack of accurate information, I attempted to use the fake dataset to mimic the percentage of men who use heroin by making it so that only 1/3 of the data resulted in overdoses. However, I quickly learned the troubles with that as well. That data would have insinuated that every male who uses heroin would overdose. Additionally, the overdose rate being so infrequent lead to recurring insufficient results.

The current data analysis provides that assuming the individual is a drug user, factoring in sex, age and drug of choice (Heroin, Fentanyl, Oxycodone, and Alcohol/EtOH), there is a 50% chance they will overdose. However, we know this information is inaccurate seeing as we coded the results to be 50% by adding the fake data (1,2). Yet, from this information, we can tell that out of those who have overdosed, the most common overdose is due to Heroin and the least likely overdose is from Alcohol/EtOH.

Working Model

This model would provide a properly-formed prediction when fed properly-formed data. It is due to the data, or lack thereof that this model is not ready for real-world application. To be real world ready, the data would need to be bolstered by adding information regarding non-drug-users and information about drug-users who did not overdose. Additionally, for it to be truly accurate, the ratio of drug-users to non-drug-users would have to match the statistics in Connecticut, but specialized by drug. The ratio of Male:Female would also have to be accurate based on the Connecticut census, and additionally accurate for every age group.

For this model to be real-world ready, a lot of information would need to be added. Despite not having a practical statistical evaluation, I found this exercise to be extremely helpful in hilighting the possibilities of statistical machine evaluation through the use of algorithms.

In [44]:
import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()

Data Cleaning

Here we load the data we collected and get it all ready to feed to our statistical model(s). That is, we are trying to make a table with one target column and one or more features. Here I'm loading happiness.csv from: https://data.somervillema.gov/Happiness/Somerville-Happiness-Survey-responses-2011-2013-20/w898-3dfm Note: you can find information on the data elements at this link.

In [45]:
# Load and peek at your data. Change the file name as needed. 
processed_data_df = pd.read_csv('overdose.csv') 
processed_data_df.head()
Out[45]:
Sex Age Heroin Cocaine Fentanyl Oxycodone Oxymorphone EtOH Hydrocodone Benzodiazepine Methadone Amphet Tramad Morphine
0 Male 41 0 0 1 0 0 0 0 0 0 0 0 0
1 Female 48 1 0 1 1 0 0 0 1 0 0 0 0
2 Male 28 1 0 1 0 0 0 0 0 0 0 0 0
3 Male 40 0 0 1 1 0 0 0 1 0 0 0 0
4 Male 52 1 1 0 0 0 1 0 0 0 0 0 0
In [46]:
# You can replace values in a column based on logic like so
# Note: I used the unique values found above to inform my logic.
# That is, I took the unique text lables and translated them into numbers.
# It's clear that different surveys had different buckets. So I probably 
# sould limit myself to years using the same metrics, but for our purposes
# I'm just going to run with a quick and dirty translation. 

processed_data_df.loc[processed_data_df['Sex'] == 'Male', 'Sex'] = 1
processed_data_df.loc[processed_data_df['Sex'] == 'Female', 'Sex'] = 0
processed_data_df.head()
Out[46]:
Sex Age Heroin Cocaine Fentanyl Oxycodone Oxymorphone EtOH Hydrocodone Benzodiazepine Methadone Amphet Tramad Morphine
0 1 41 0 0 1 0 0 0 0 0 0 0 0 0
1 0 48 1 0 1 1 0 0 0 1 0 0 0 0
2 1 28 1 0 1 0 0 0 0 0 0 0 0 0
3 1 40 0 0 1 1 0 0 0 1 0 0 0 0
4 1 52 1 1 0 0 0 1 0 0 0 0 0 0
In [47]:
# To make sure all of your columns are stored as numbers, use the pd.to_numeric method like so.
processed_data_df = processed_data_df.apply(pd.to_numeric, errors='coerce')
# errors='coerce' will set things that can't be converted to numbers to NaN
# so you'll want to drop these like so.
processed_data_df = processed_data_df.dropna()
processed_data_df.head()
Out[47]:
Sex Age Heroin Cocaine Fentanyl Oxycodone Oxymorphone EtOH Hydrocodone Benzodiazepine Methadone Amphet Tramad Morphine
0 1 41 0.0 0 1 0 0 0 0 0 0 0 0 0.0
1 0 48 1.0 0 1 1 0 0 0 1 0 0 0 0.0
2 1 28 1.0 0 1 0 0 0 0 0 0 0 0 0.0
3 1 40 0.0 0 1 1 0 0 0 1 0 0 0 0.0
4 1 52 1.0 1 0 0 0 1 0 0 0 0 0 0.0
In [48]:
# The second set will be for classifiers where the target is a class.
# Happiness
data_df = processed_data_df[[
                               'Sex', 
                               'Age', 'Heroin','Oxycodone','EtOH','Fentanyl'
                              
                               ]].copy()
data_df.head()
Out[48]:
Sex Age Heroin Oxycodone EtOH Fentanyl
0 1 41 0.0 0 0 1
1 0 48 1.0 1 0 1
2 1 28 1.0 0 0 1
3 1 40 0.0 1 0 1
4 1 52 1.0 0 1 0
In [49]:
data_df["OD"] = 1
data_df.head()
Out[49]:
Sex Age Heroin Oxycodone EtOH Fentanyl OD
0 1 41 0.0 0 0 1 1
1 0 48 1.0 1 0 1 1
2 1 28 1.0 0 0 1 1
3 1 40 0.0 1 0 1 1
4 1 52 1.0 0 1 0 1
In [50]:
# not reall!
data_1 = data_df.copy()
data_2 = data_1.copy()
data_2["OD"] = 0
for i in range(1,2):
    data_df = pd.concat([data_df, data_2],ignore_index=True)
    data_df.head()
    
#data_df = data_df.dropna()
data_df.head()
Out[50]:
Sex Age Heroin Oxycodone EtOH Fentanyl OD
0 1 41 0.0 0 0 1 1
1 0 48 1.0 1 0 1 1
2 1 28 1.0 0 0 1 1
3 1 40 0.0 1 0 1 1
4 1 52 1.0 0 1 0 1
In [51]:
data_df[data_df["OD"]==1]
Out[51]:
Sex Age Heroin Oxycodone EtOH Fentanyl OD
0 1 41 0.0 0 0 1 1
1 0 48 1.0 1 0 1 1
2 1 28 1.0 0 0 1 1
3 1 40 0.0 1 0 1 1
4 1 52 1.0 0 1 0 1
5 0 52 0.0 0 0 1 1
6 0 33 1.0 0 0 1 1
7 0 22 0.0 0 0 0 1
8 1 45 0.0 0 0 0 1
9 1 62 0.0 0 0 1 1
10 0 46 1.0 0 0 1 1
11 0 31 1.0 0 0 0 1
12 1 27 1.0 0 0 0 1
13 0 37 0.0 0 0 1 1
14 0 41 0.0 0 0 0 1
15 1 55 0.0 0 1 1 1
16 1 34 1.0 0 0 1 1
17 1 50 1.0 0 0 1 1
18 1 24 0.0 0 0 1 1
19 1 34 1.0 0 0 0 1
20 0 48 0.0 0 0 0 1
21 1 32 0.0 0 0 1 1
22 1 42 0.0 0 0 1 1
23 1 40 0.0 1 0 1 1
24 1 28 1.0 0 0 1 1
25 0 60 0.0 1 1 0 1
26 1 58 1.0 0 0 1 1
27 0 61 0.0 0 0 0 1
28 1 43 1.0 0 0 0 1
29 0 45 1.0 0 0 1 1
... ... ... ... ... ... ... ...
493 0 62 0.0 0 1 0 1
494 1 41 0.0 0 1 1 1
495 1 34 1.0 0 0 1 1
496 0 37 0.0 0 0 0 1
497 1 34 1.0 0 0 1 1
498 1 33 0.0 0 1 0 1
499 1 25 1.0 0 0 1 1
500 0 48 1.0 0 0 1 1
501 0 38 0.0 0 0 0 1
502 1 53 0.0 1 1 0 1
503 1 56 0.0 1 0 0 1
504 0 49 0.0 0 0 0 1
505 1 27 0.0 0 0 1 1
506 1 43 0.0 0 0 1 1
507 1 37 1.0 0 0 1 1
508 1 34 0.0 0 1 0 1
509 0 54 0.0 1 1 0 1
510 1 49 1.0 0 0 1 1
511 1 29 0.0 0 0 1 1
512 1 54 0.0 0 1 0 1
513 0 23 0.0 0 0 0 1
514 0 48 0.0 0 0 0 1
515 0 53 1.0 0 0 0 1
516 1 59 1.0 0 0 0 1
517 1 22 1.0 0 0 1 1
518 0 57 0.0 0 0 0 1
519 1 53 0.0 0 0 0 1
520 0 63 1.0 0 0 1 1
521 0 53 1.0 0 1 1 1
522 1 54 0.0 0 1 1 1

523 rows × 7 columns

In [52]:
data = data_df
holdout = data.sample(frac=0.2)
training = data.loc[~data.index.isin(holdout.index)]
print(len(training),len(holdout))
837 209
In [53]:
# Define the target (y) and feature(s) (X)
features_train = training.drop("OD", axis=1).as_matrix(columns=None)
labels_train = training["OD"].as_matrix(columns=None)

features_test = holdout.drop("OD", axis=1).as_matrix(columns=None)
labels_test = holdout["OD"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of 1s: %s\n"%(len(data[data["OD"]==1])/len(data)))
Percentage of 1s: 0.5

In [54]:
# Logistic Regression
model = LogisticRegression(fit_intercept = False, C = 1e9)
clf = model.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("Logistic Regression")
evaluate(pred, labels_test) 
print (clf.coef_)

from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("\nDecision Tree")
evaluate(pred, labels_test)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("Random Forest")
evaluate(pred, labels_test)  

from sklearn.svm import SVC
clf = SVC(kernel="rbf",probability=True)
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("SVM")
evaluate(pred, labels_test)  
Logistic Regression
Accuracey: 0.435406698565

True Negatives: 66
False Positives: 31
False Negatives: 87
True Positives: 25
Recall: 0.446428571429
Precision: 0.223214285714
F1 Score: 0.297619047619
[[ 0.05710977 -0.00079626 -0.00410777 -0.06475672 -0.08758175 -0.03510929]]

Decision Tree
Accuracey: 0.263157894737

True Negatives: 35
False Positives: 62
False Negatives: 92
True Positives: 20
Recall: 0.243902439024
Precision: 0.178571428571
F1 Score: 0.20618556701
Random Forest
Accuracey: 0.177033492823

True Negatives: 19
False Positives: 78
False Negatives: 94
True Positives: 18
Recall: 0.1875
Precision: 0.160714285714
F1 Score: 0.173076923077
SVM
Accuracey: 0.311004784689

True Negatives: 41
False Positives: 56
False Negatives: 88
True Positives: 24
Recall: 0.3
Precision: 0.214285714286
F1 Score: 0.25