Intake Data Analysis

By Erica Dumore

About

For our final coding the law project, we were instructed to pair up with a partner of a law firm to utilize what we had learned in one, or all three, of our previous class projects to create something that would help the partner and their law firm with a need or a tool to allow for more efficient daily practice.

I teamed up with Palace Law, a law firm located in Seattle, Washington, that handles primarily personal injury and workers compensation matters. The basis of this project was to create a client intake algorithm, an idea that the firm had come up with themselves and had already begun working on, but had not yet executed. The firm was gracious enough to invite me to become apart of their project and allow me to help them take steps towards bringing their vision to life.

My portion of this project is but a link in the overall chain, a starting point for Palace Law as they begin to bring their vision to life. Therefore, a lot more attention will be required but this first step has proven that the idea of the partners at Palace Law is one that has a lot of potential to provide more efficiency in law firms everywhere.

Background

A client calls a law firm and explains what happened to them, explaining what they believe to be a legal issue and how they feel they were wronged. The staff on the other end of the telephone gathers what information they can, name, telephone number, a brief explanation of what happened, and then sends the call along to an attorney. The attorney gathers additional information over the phone and, if necessary, meets with the client to further determine whether the client has an appropriate legal issue and decides whether or not to take on the case. This has been the process in the legal profession for as long as anyone can remember, but is it truly an efficient use of attorney's time?

This is the question Palace Law began grappling with, believing there must be a more efficient way to decide whether or not to take on a case. If the consultation does not result in the signing of a retainer agreement, both the lawyer and the potential client have wasted their time, time that could have been put towards a lucrative case destined for success.

Palace Law felt that if they could eliminate the step between the client calling into the firm and the firm deciding whether it would be a case worth taking, they would be able to be a lot more efficient with their time and resources. It would also be more beneficial to the potential client because they would know, prior to attending a consultation, whether or not their matter was one the firm was willing to take on, allowing them to seek representation elsewhere without having to waste their time at an intake consultation only to find out the firm was not going to take on their case.

Thus, Palace Law came up with the idea to use data analysis to help determine, based off of information gathered from a potential clients initial telephone call, whether the potential case would be one in which the firm would be successful and therefore, one that the firm should conduct an intake consultation for, eliminating all potential cases that would likely be unsuccessful and thus a waste of time and resources for the firm. Additionally, the information could suggest steps necessary to increase the value of cases that are accepted.

This final project aims to bring this algorithm to life using data gathered by Palace Law over the last two months. This algorithm focuses on telling the firm whether to accept or reject future potential clients based off of the information entered during initial telephone calls. After inputting the information and creating the data analysis an accuracy score will be determined to tell us at this moment, with the information we have, how accurate this algorithm is, indicating how accurate it can be with more information. The next step would then be to take this information and generate a model that would take the information provided to this algorithm and generate an answer of ‘accept’ or ‘reject.’

Creation Process

Framing the Problem:

This project primarily seeks to benefit legal professionals, although there is a benefit to clients as well. This algorithm, as discussed briefly above, will allow attorneys to focus their efforts on strong merit based claims that will result in success for the client and will be lucrative to the firm.

This benefits attorneys because the reality of this profession is that time is money. If an attorney can determine whether a case is worth taking before having to sit down with a client for an hour, that is time that they can spend on expediting the matter itself. Additionally, those hours add up and thus if a firm can eliminate the hours spent in intake consultations that do not result in signed retainer agreements, then the firm gains back precious time that it can put towards ensuring its matters that are more than likely going to result in success, do. It allows for the ultimate efficiency on behalf of both clients and the firm, eliminating gambles or risk matters, wasting resources, and thus ensuring overall success.

This can arguably also benefit potential clients because, as eluded to above, clients will know prior to attending an intake consultation whether the firm would be willing to take on their case. If a law firm could call a client and inform them that their matter did not appear to be a good fit for the firm, they would be able to go to another firm and seek alternative counsel without having to go through the whole intake process, which can be a very intimidating process for potential clients who have not faced a legal issue before.

It could also help provide clarity or a better understanding to clients who believe they have a strong legal claim when they in fact do not, by explaining to them the algorithm and how based on past information their potential claim is unlikely to be successful. This could save the potential client a lot of time and resources, helping them to realize at the forefront that lawsuits are a long and difficult process and one that they probably would not want to go through knowing there is such a low chance of a successful outcome. It could also help clients and attorneys to identify issues that are decreasing the value of a case.

Research, Ideation & Prototyping:

Throughout this process I corresponded back and forth on multiple occasions with Jordan, an attorney at Palace Law, and his colleagues. Jordan and I corresponded through both email and a conference call on October 12, 2017 where we discussed the project, the process, the signing of a non-disclosure agreement, and began to throw some ideas around for how this project could be executed. (Professor Colarusso was cc'd on all of the email correspondence, which I am happy to forward again should that be needed) We also discussed how I would receive the information, utilizing a program called AirTable, which would allow Palace Law to continually update the information as more data was received.

During our conference call, I explained the three projects that our class will have/had completed before the final was due and stated that it was pretty clear that this project should be modeled after our third class project, utilizing data analysis, and that it did not appear to be a good candidate for QnA Markup or as a Twitter Bot. Jordan agreed and after discussions with Professor Colarusso I was able to generate a plan to create the proper code to analyze the data located within the two spreadsheets, clean the data, generate an algorithm, and classify the accuracy. The end result would tell us whether or not this idea is one that could be executed and whether the algorithm could be relied upon in the future, should a model be created.

A lot more back and forth email correspondence took place before the final project was implemented, this will be discussed further under the following two headings.

User Testing & Refinement:

STEP ONE:

As discussed, throughout this process Palace Law and I were in correspondence regarding the creation of this algorithm. The first step required me to access AirTable, a website like Google Docs that allowed easy access of the data between the firm and my computer. Once I accessed the data through AirTable I was able to download the two spreadsheets containing all the data collected. I downloaded each file as a .csv and placed them in the folder I created for this project. I then retitled the files and added the names into my two lines of code. After each of the two documents were loaded on their own, I wrote a line of code allowing the two files to merge into one using the firms identifier, the ID numbers, as the access point to ensure all of the information stayed with the correct client.

Each line of code containing the two file names were labeled as File 1 and File 2. I did this to ensure no matter what .csv file was attached as File 1 and File 2 the third cell, the cell that merged the two documents, would never have to be re-written. This allowed me to change the two excel spreadsheet .csv files throughout the process as the firm provided me with updated increased data. This also will benefit the firm should they decide to continue using this algorithm, because updating the excel spreadsheets to reflect the new data will be a simple task.

Throughout the process the firm provided me with updated spreadsheets three times. Each time I simply changed the spreadsheet number when I saved it as a .csv file and switched out the previous .csv files with the two new ones.

The .cvs files have not been linked to this notebook because they contain confidentiality client information.

STEP TWO:

After merging the two excel spreadsheets, I had to clean up the data so that only viable data would be considered. Meaning, I only wanted to include the target, whether or not to accept the potential case, and the features that pertain to that outcome. Anything irrelevant to determining the target I would have to drop from the spreadsheet. I chose to do this through code, rather than on the spreadsheet itself because I want the features to be the same every time even as the data is updated by the firm. This ensures that all Palace Law will have to do upon updating the spreadsheet is to change the file name, the features will always remain the same.

Below in the cells of code you can see which columns I dropped that would not effect the target outcome of whether the potential claim should be 'accepted' or 'rejected'. These columns included things like the persons date of birth, because date of birth and age provided the same information, so there was no need for both columns to be factored in. I also dropped the claim number, because all potential clients had been given an identification ID number upon being entered into the excel spreadsheet. This step was time consuming and required speaking with Jordan at Palace Law to ensure I did not drop any columns that would be important to the outcome. It also required me to gain a better understanding of what each column meant so that I could consider how it may effect the target.

In our email correspondence I explained that I could not include any columns that had narratives as answers, meaning an explanation further than yes/no or a similar answer set. This is because narratives are usually not repetitive and therefore there would be no way for the code to recognize patterns between the answers. Additionally, as will be discussed below, all answers had to be in the form of a number (zero or one) and narrative answers cannot be transferred into numbers because they are not uniform but rather are unique.

Therefore, I reached out to Jordan to gain further information regarding certain columns and to ask a few questions regarding what abbreviations stood for. Again, I wanted to be sure that when I converted the columns into features, I was including only the viable columns that would help answer the overall question of whether or not Palace Law should accept or reject the case.

Additional dropped columns were: the attorney assigned to the case, the Clio contact information, the preferred method of contact, how the firm obtained their phone number, and the date of the call. I eliminated these because they did not influence whether or not to accept to reject a case, they simply provided administrative information.

A few columns that I eliminated but feel would be interesting to play around with to see if they alter the data at all or influence the accuracy were: the injury date, the treating doctor, and the employer. I believe there is a chance that this information could provide interesting patterns, such as does a certain employer result in a lot of workers compensation liability claims and if so, do these claims often result in successful lawsuits that should be accepted compared to other employers.

STEP THREE:

Next, I created dummy variables for each of the columns that I had decided would be influential to determining the target. I had to create dummy variables for all of the columns that were answered in a term or a phrase rather than a number. These dummy variables allowed me to create additional columns that served as the answers in the original columns, further extending my spreadsheet. Now all answers that were in the form of a term or phrase it would place a one in the column under that answer but if it did not answer using that term or phrase then it would place a zero. These numbers would then be used towards determining the target.

The third stage took by far the longest of the overall process. Because there were so many columns I had to create multiple cells of dummy variables because I kept running into a problem where if I had too many dummy variable lines of code, the code would only generate a select few of the lines, often the bottom three lines, but unfortunately not all of them, resulting in only a select group of the dummy variable columns actually becoming dummy variables. Additionally, I ran into a problem because the excel spreadsheet accounted for potential clients having more than one body part injury, but not everyone would have multiple injuries and therefore those who did not would have 'NaN' throughout this column, which skewed the data.

After reviewing the data and meeting with Professor Colarusso, I discovered that the error with my dummy variables was actually due to a glitch in my coding. I had accidentally been re-writing my dummy codes in each new cell because I was telling the first line of dummy variables to be data and the next two to be incorporated with that data, when I should have been writing the code so that the original dummy variable was written to data, and the following dummy variable lines of code were written to df_new. Once I fixed this error all of my dummy variables worked and the cells were able to be run.

Additionally at this stage, I inserted a line of code that generated an updated working excel spreadsheet (through the use of a .csv file) so that I was able to constantly check my work, ensuring I had not missed a certain column and that all dummy columns were now producing numeric responses.

I also ran into a problem at this stage because some of the names of the columns had to be changed in order for me to run them through the dummy code or as a column later on. One that gave me the most trouble was with the column labeled "Referred By (Individual)" because of the parentheses in the title, which the code kept attempting to read as a line of code. This is something the firm should consider when labeling the data spreadsheet and when inserting answers. At this point this one will be overwritten in this set of code but if they were to re-write the code or to create another algorithm, this would be something they would have to be careful of.

STEP FOUR:

After running all of my dummy variables and checking to make sure that only the proper feature columns remained and that they all produced number variables, I ran into another problem because there were duplicate columns. This inconsistency resulted in a column without any numbers, which caused a hiccup and resulted in the final spreadsheet presenting as containing zero rows.

Knowing this could not be the case, I opened the .csv file that generated the data up until that cell, discussed in step three that allowed me to double check my work, and found these which row were causing the problem.

The problem was that there had been inconsistencies in the answers, specifically under body parts. Sometimes the answer had been entered as HEAD and sometimes it had been entered as head. This little inconsistency, due to human error, the computer interpreted as offsetting the whole data set.

Therefore, I entered a line of code after the cell that merged the spreadsheets that changed all answers within the rows into all caps. This then required me to go through the other cells of code and change any answers, all of the newly created dummy variable columns, into all caps so that the code would recognize them. Although a relatively simple fix, it could be avoided if consistency is kept when entering data into the original spreadsheets. This again is something the firm will want to look out for in the future. Although this algorithm is set to detect/bi-pass these differences, a new algorithm or a model would suffer this same issue.

STEP FIVE:

Next, I generated the code so that the algorithm was in its final form. Once I had verified the final form, I inserted a line of code to eliminate all NaN, whenever they were found within the data. NaN meant that there was no answer given in that spot. After running this line of code and dropping all of the NaN's, I ran a line of code that told me the percentage of the amount of times the algorithm resulted in accepting the potential claim. As you will see below, this cell told us that 40% of the time the firm will accept the case.

STEP SIX:

Lastly, I then had the final data set run the data through four different classifiers to test for its accuracy. The first run through I set the holdout to .2, this generated impressive accuracy data. The linear regression classifier was the strongest, coming out at .9 accuracy score and a .91 F1 score. After running the code a few times at a .2 holdout, these numbers dropped a little bit but always remained above a .7 accuracy score for linear regression.

I then changed the holdout to .5 which resulted in a decrease in the linear regression score but an increase in the decision tree classifier accuracy level. I did this more to test out the data and play around with the accuracy but then changed the holdout back to .2 because this holdout provides the most realistic accuracy output.

After seeing that linear regression was the stronger classifier on each run through, I generated a breakdown cell. This allowed me to further understand what was influencing the high accuracy score. You will see this in the last cell as a list of numbers. These numbers correlate with the columns. By finding the largest numbers and counting which column they correlate to you are able to see what are the most influential columns to the accuracy of the model.

This is something that the firm will want to consider. After locating the largest numbers and correlating them to the column the correspond with, I was able to see that some NaN columns still existed, meaning some of the most influential features are either suppose to be labeled as 'other' or are simply created because no other answer was given there. This is something that would want to be further explored because it is very influential to the outcome of whether or not the potential claim is accepted.

Product:

Intro Pitch:

During Week 10 of class we all provided a five minute pitch of our final project. During this time, through the use of a PowerPoint presentation, I informed my classmates on the details explained under the 'About' and 'Background' headings, including what my overall goal was for the project, to create a working algorithm based off of the law firms data, and how I planned to model my final project off of our class project three assignment.

Unfortunately, due to an accidental GitHub malfunction this PowerPoint is no longer accessible, otherwise I would link it here.

Impact & Efficiencies:

This project, and the overall aspiration of Palace Law's intake algorithm, would have a huge impact on the legal profession, especially on small firms. The algorithm would allow partners and associates who are already working at maximum capacity to skip a step, allowing the algorithm (which would be created into a model) to do the work of deciding whether or not to accept a potential clients claim. This would allow the partner or associate to come into the case when they know that it has the potential to be successful and therefore lucrative for the firm.

Thus, it would increase partners and associates’ overall efficiency, without changing their daily routines. Meaning as a result legal professionals will be able to spend more time focused on substantive work leading to an increase in successful cases, which in turn leads to an increase of happy clients and the firms overall income. It would also decrease the number of risk cases acquired by the firm, which typically result in a time-sink and a loss for the firm financially.

Real World Viability

At this moment this algorithm is real world viable, in that it can predict with accuracy whether a firm should accept a potential case based off of the spreadsheets input information. The accuracy levels will only continue to grow as more data is collected by Palace Law. Therefore, although it is able to be relied on at this moment as a predictor, it has the ability to be that much more reliable in time. The data spreadsheet was updated numerous times throughout the creation process, as discussed above, and each time the accuracy levels were in direct correlation to the increase in data. Thus, the data demonstrates that this project has the ability to predict accuracy and once the data set becomes larger, including start to finish cases, it will further ensure reliability.

Although this algorithm is viable at this point in time, it is able to be altered should Palace Law want to play around with the features to see how they effect the accuracy and potential outcomes. As mentioned above, if Palace Law feels that a feature does not serve the outcome and should not be included, or that there is a feature that was not included that they feel should have been, it would be easy to implement this change without affecting the viability of the working algorithm.

Sustainability:

As stated throughout this project, this is the first step for Palace Law's overall vision. This algorithm is one that they can be easily maintained and built off of. As Palace Law receives more intake calls and updates the information within the excel spreadsheet, all they will need to do is simply switch the names of the two excel spreadsheet files to correlate with the most recent .csv file name and then rerun the cells from start to finish. This could be done bi-weekly, monthly, or even sporadically, should they feel they had a high call volume one day.

The next step in this project will be to make this algorithm into a model. This will allow Palace Law to input one clients intake information and have the model generate the answer of whether or not to accept the case. This next step will likely be taken on next semester by a student involved in the technology lab here at Suffolk University Law School, should Palace Law decide to continue to work with students on this project.

In [ ]:
import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()
In [ ]:
# Load and peek at your data. Change the file name as needed. 
file_1 = pd.read_csv('data/PNC5.csv') 
print("size: ",len(file_1))
file_1.head()
In [ ]:
file_2 = pd.read_csv('data/PNC6.csv') 
print("size: ",len(file_2))
file_2.head()
In [ ]:
# docs on merge: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
# breakdown of joins: https://www.codeproject.com/KB/database/Visual_SQL_Joins/Visual_SQL_JOINS_orig.jpg

#data = file_1.merge(file_2, on=['ID','CLAIM'])
data = file_1.merge(file_2, on=['ID'])
data = data.apply(lambda x: x.astype(str).str.upper())
print("size: ",len(data))
data.head()
In [ ]:
#want to delete this column because of the () but it will not let me change the name so I cannot delete it
data = data.rename(columns={
                                'REFERRED BY (INDIVIDUAL) ': 'Referred_by_individual'
                                                     })
data.head()
In [ ]:
# you'll want to get a table with the target and the features only
data = data.drop(['BODY PART 6',
                  'BODY PART 7',
                  'BODY PART 8',
                  'IMPORT FROM_x',
                  'MECHANISM OF INJURY',
                  'MD',
                  'EMPLOYER',
                  'SIE/TPA',
                  'INJURY DATE',
                  'CALL_DATE',
                  'STAFF',
                  'CLAIM_x',
                  'DOB',
                  'CONTACT PREFERENCE',
                  'ATTY',
                  'Phone Number Source',
                  'Clio Contact',
                  'IMPORT FROM_y',
                  '# of IMEs',
                  'CLAIM_y',
                  'Intake Date_x',
                  'Intake Date_y',
                  'ON TIMELOSS',
                  'TIMELOSS GAPS',
                  'APPLIED',
                  'Referred_by_individual'], 1)

data.head()
print("size: ",len(data))
In [ ]:
# get rid of pending
data = data[data["ACCEPT?"]!='PENDING']
data = data[data["CLAIM STATUS"]!='PENDING']
print("size: ",len(data))
In [ ]:
# https://chrisalbon.com/python/pandas_convert_categorical_to_dummies.html
df_new = pd.concat([data, pd.get_dummies(data['REFERRAL SOURCE'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['REASON'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['BODY PART 1'])], axis=1)


df_new.head()
In [ ]:
#df_new = pd.concat([df_new, pd.get_dummies(data['BODY PART 2'])], axis=1)
#df_new = pd.concat([df_new, pd.get_dummies(data['BODY PART 3'])], axis=1)
#df_new = pd.concat([df_new, pd.get_dummies(data['BODY PART 4'])], axis=1)
#df_new = pd.concat([df_new, pd.get_dummies(data['BODY PART 5'])], axis=1)

df_new.loc[df_new['BODY PART 2'] == 'KNEE', 'KNEE'] = 1
df_new.loc[df_new['BODY PART 2'] == 'HEAD', 'HEAD'] = 1
df_new.loc[df_new['BODY PART 2'] == 'LEG', 'LEG'] = 1
df_new.loc[df_new['BODY PART 2'] == 'LEG(S)', 'LEG(S)'] = 1
df_new.loc[df_new['BODY PART 2'] == 'LOW-BACK', 'LOW-BACK'] = 1
df_new.loc[df_new['BODY PART 2'] == 'MID-BACK', 'MID-BACK'] = 1
df_new.loc[df_new['BODY PART 2'] == 'MISC/OTHER', 'MISC/OTHER'] = 1
df_new.loc[df_new['BODY PART 2'] == 'NECK', 'NECK'] = 1
df_new.loc[df_new['BODY PART 2'] == 'SI', 'SI'] = 1
df_new.loc[df_new['BODY PART 2'] == 'SHOULDER', 'SHOULDER'] = 1
df_new.loc[df_new['BODY PART 2'] == 'WRIST', 'WRIST'] = 1

df_new.loc[df_new['BODY PART 3'] == 'KNEE', 'KNEE'] = 1
df_new.loc[df_new['BODY PART 3'] == 'HEAD', 'HEAD'] = 1
df_new.loc[df_new['BODY PART 3'] == 'LEG', 'LEG'] = 1
df_new.loc[df_new['BODY PART 3'] == 'LEG(S)', 'LEG(S)'] = 1
df_new.loc[df_new['BODY PART 3'] == 'LOW-BACK', 'LOW-BACK'] = 1
df_new.loc[df_new['BODY PART 3'] == 'MID-BACK', 'MID-BACK'] = 1
df_new.loc[df_new['BODY PART 3'] == 'MISC/OTHER', 'MISC/OTHER'] = 1
df_new.loc[df_new['BODY PART 3'] == 'NECK', 'NECK'] = 1
df_new.loc[df_new['BODY PART 3'] == 'SI', 'SI'] = 1
df_new.loc[df_new['BODY PART 3'] == 'SHOULDER', 'SHOULDER'] = 1
df_new.loc[df_new['BODY PART 3'] == 'WRIST', 'WRIST'] = 1

df_new.loc[df_new['BODY PART 4'] == 'KNEE', 'KNEE'] = 1
df_new.loc[df_new['BODY PART 4'] == 'HEAD', 'HEAD'] = 1
df_new.loc[df_new['BODY PART 4'] == 'LEG', 'LEG'] = 1
df_new.loc[df_new['BODY PART 4'] == 'LEG(S)', 'LEG(S)'] = 1
df_new.loc[df_new['BODY PART 4'] == 'LOW-BACK', 'LOW-BACK'] = 1
df_new.loc[df_new['BODY PART 4'] == 'MID-BACK', 'MID-BACK'] = 1
df_new.loc[df_new['BODY PART 4'] == 'MISC/OTHER', 'MISC/OTHER'] = 1
df_new.loc[df_new['BODY PART 4'] == 'NECK', 'NECK'] = 1
df_new.loc[df_new['BODY PART 4'] == 'SI', 'SI'] = 1
df_new.loc[df_new['BODY PART 4'] == 'SHOULDER', 'SHOULDER'] = 1
df_new.loc[df_new['BODY PART 4'] == 'WRIST', 'WRIST'] = 1

df_new.loc[df_new['BODY PART 5'] == 'KNEE', 'KNEE'] = 1
df_new.loc[df_new['BODY PART 5'] == 'HEAD', 'HEAD'] = 1
df_new.loc[df_new['BODY PART 5'] == 'LEG', 'LEG'] = 1
df_new.loc[df_new['BODY PART 5'] == 'LEG(S)', 'LEG(S)'] = 1
df_new.loc[df_new['BODY PART 5'] == 'LOW-BACK', 'LOW-BACK'] = 1
df_new.loc[df_new['BODY PART 5'] == 'MID-BACK', 'MID-BACK'] = 1
df_new.loc[df_new['BODY PART 5'] == 'MISC/OTHER', 'MISC/OTHER'] = 1
df_new.loc[df_new['BODY PART 5'] == 'NECK', 'NECK'] = 1
df_new.loc[df_new['BODY PART 5'] == 'SI', 'SI'] = 1
df_new.loc[df_new['BODY PART 5'] == 'SHOULDER', 'SHOULDER'] = 1
df_new.loc[df_new['BODY PART 5'] == 'WRIST', 'WRIST'] = 1


df_new.head()
In [ ]:
df_new = pd.concat([df_new, pd.get_dummies(data['GENDER'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['NATURE OF INJURY'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['TYPE'])], axis=1)

df_new.head()
In [ ]:
df_new = pd.concat([df_new, pd.get_dummies(data['LINE OF WORK'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['SURGERY'])], axis=1)
df_new = pd.concat([df_new, pd.get_dummies(data['CLAIM STATUS'])], axis=1)

df_new = df_new.rename(columns={
                                'ACCEPT?': 'ACCEPT'
                              })

df_new.loc[df_new['ACCEPT'].str.upper() == 'REJECT', 'ACCEPT'] = 0 
df_new.loc[df_new['ACCEPT'].str.upper() == 'ACCEPT', 'ACCEPT'] = 1 

df_new.loc[df_new['CURRENTLY\n WORKING'].str.upper() == 'YES', 'CURRENTLY\n WORKING'] = 1 
df_new.loc[df_new['CURRENTLY\n WORKING'].str.upper() == 'NO', 'CURRENTLY\n WORKING'] = 0 




df_new.head()
In [ ]:
# make things usable
final = df_new.copy()

final = final.drop(['REFERRAL SOURCE',
                  'REASON',
                  'BODY PART 1',
                  'BODY PART 2',
                  'BODY PART 3',
                  'BODY PART 4',
                  'BODY PART 5',
                  'GENDER',
                  'NATURE OF INJURY',
                  'LINE OF WORK',
                  'SURGERY',
                  'CLAIM STATUS',
                  'ID',
                  'TYPE',
                  'ZIP'], 1)

final.head()
In [ ]:
final.to_csv("data/quick_write_to_check.csv")
In [ ]:
final = final.apply(pd.to_numeric, errors='coerce')
final = final.dropna()
#print("size: ",len(df_new))
print("size: ",len(data))

final.to_csv("data/combined.csv")
final.head()
In [ ]:
data = final
holdout = data.sample(frac=0.2)
training = data.loc[~data.index.isin(holdout.index)]

# Define the target (y) and feature(s) (X)
features_train = training.drop("ACCEPT", axis=1).as_matrix(columns=None)
labels_train = training["ACCEPT"].as_matrix(columns=None)

features_test = holdout.drop("ACCEPT", axis=1).as_matrix(columns=None)
labels_test = holdout["ACCEPT"].as_matrix(columns=None)

# What percentage of the time is target Y?
print("Percentage of 1s: %s\n"%(len(data[data["ACCEPT"]==1])/len(data)))
In [ ]:
# Read this https://en.wikipedia.org/wiki/Confusion_matrix

# https://en.wikipedia.org/wiki/Logistic_regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept = False, C = 1e9)
clf = model.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("==========================")
print("Logistic Regression")
evaluate(pred, labels_test)  

# https://en.wikipedia.org/wiki/Random_forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("\n==========================")
print("\nRandom Forest")
evaluate(pred, labels_test)  

# https://en.wikipedia.org/wiki/Support_vector_machine
from sklearn.svm import SVC
clf = SVC(kernel="rbf",probability=True)
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("\n==========================")
print("\nSVM")
evaluate(pred, labels_test)  

# http://scikit-learn.org/stable/modules/tree.html#tree
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("\n==========================")
print("\nDecision Tree")
evaluate(pred, labels_test)
In [ ]:
# https://en.wikipedia.org/wiki/Logistic_regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept = False, C = 1e9)
clf = model.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("==========================")
print("Logistic Regression")
evaluate(pred, labels_test)  

model.coef_