Statistical/Machine Learning Model: Property Value - Cambridge

By: Charmaine Wood

Beginning Process

I began this project not knowing what I wanted to focus on but after a few days I really wanted to work with information regarding the Bar Exam. I was really excited to begin the research and was actually pretty successful obtaining information. I began working with the National Conference of Bar Examiners and was able to get getting for the last 27 years. I organized the information and later realized when I began implementing the information into my notebook that the information is very interesting but would be bery difficult to incorporate into this project. This is when I needed to figure out a different route. I began looking at the website that were sent to the class and came across a spreadsheet with information regarding the property value of a variety of properties in Cambridge.

Refinement

As I mentioned above I had to start from square one after a lot of research and organizing the data related to the Bar Exam. When I began working with the new set of data I knew that the outcome would be the price at which the property was sold but I was not sure right away which variables I would focus on. As Iwas looking through the data I kept asking myself what would affect the cost of a property. There were actually a few components I thought may work but I was most interested in the affect the assessment and building size would have on the price it was sold for. When I got to the chart section I realized that I had way to much data and had to trim it down, that is why I only worked with 3 family households.

Real-World Viability

I believe that this project does have rel-world viability due to the R squared value being 0.346219175385 and the accuracy value being 0.115427249333. There are definitely a few outliers but most of my data clumped and I believe this would be the end result of any other property type entered into the code.

In [404]:
import os
try:
    inputFunc = raw_input
except NameError:
    inputFunc = input

import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import numpy as np
 
import seaborn as sns
from statsmodels.formula.api import ols

from sklearn import linear_model
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from patsy import dmatrices

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import random



# Custom functions

def evaluate(pred, labels_test):
    acc = accuracy_score(pred, labels_test)
    print ("Accuracey: %s"%acc)
    tn, fp, fn, tp = confusion_matrix(labels_test, pred).ravel()

    recall = tp / (tp + fp)
    percision = tp / (tp + fn)
    f1 = (2 / ((1/recall)+(1/percision)))

    print ("")
    print ("True Negatives: %s"%tn)
    print ("False Positives: %s"%fp)
    print ("False Negatives: %s"%fn)
    print ("True Positives: %s"%tp)
    print ("Recall: %s"%recall)
    print ("Precision: %s"%percision)
    print ("F1 Score: %s"%f1)

def plot_bound(Z_val,data,col1,col2,binary):
    # Z-val equals "Yes" value. E.g., "Y" or "1". 
    # data equals df
    # col1 and col2 defines which colums to use from data
    # Plot binary decision boundary. 
    # For this, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    
    x_min = float(data.iloc[:,[col1]].min())-float(data.iloc[:,[col1]].min())*0.10 
    x_max = float(data.iloc[:,[col1]].max()+float(data.iloc[:,[col1]].min())*0.10)
    y_min = 0.0; 
    y_max = float(training.iloc[:,[col2]].max())+float(training.iloc[:,[col2]].max())*0.10
    h_x = (x_max-x_min)/100  # step size in the mesh
    h_y = (y_max-y_min)/100  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h_x), np.arange(y_min, y_max, h_y))
    if binary == 1:
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])   
        Z = np.where(Z=="Y",1,0)
    else:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.pcolormesh(xx, yy, Z)
    plt.show()
In [405]:
 
raw_data_df = pd.read_csv('Cambridge2.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode') 
raw_data_df.head()
Out[405]:
PID GISID BldgNum Address Unit StateClassCode PropertyClass Zoning Map/Lot LandArea ... Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition Condition_OverallCondition Condition_OverallGrade Parking_Open Parking_Covered Parking_Garage UnfinishedBasementGross FinishedBasementGross
0 13639 173-24 1 1 3 Walker Ter\rCambridge, MA\r(42.380042, -71... NaN 105 THREE-FM-RES B 173-24 9157 ... NaN 1886 Average Average Good 0 0 NaN 2340 0
1 10108 129-48 1 1 Blackstone St\rCambridge, MA\r(42.362134, -7... NaN 105 THREE-FM-RES C-1 129-48 2385 ... NaN 1916 Very Good Good Average Minus 0 0 NaN 1422 0
2 5772 100-91 1 1 Florence St\rCambridge, MA\r(42.357992, -71.... NaN 105 THREE-FM-RES C 100-91 4275 ... NaN 1902 Average Average Good 0 0 NaN 1173 0
3 12221 153-4 1 1 Frost St\rCambridge, MA\r(42.386567, -71.117... NaN 105 THREE-FM-RES B 153-4 6374 ... NaN 1886 Good Good Good Very Good 3 0 NaN 1619 0
4 9681 125-11 1 1 Jay St\rCambridge, MA\r(42.36364, -71.109705) NaN 105 THREE-FM-RES BA-3 125-11 2118 ... NaN 1902 Average Average Average 0 0 NaN 704 0

5 rows × 61 columns

In [406]:
print(raw_data_df["BuildingValue"].unique())
print(raw_data_df["SalePrice"].unique())
print(raw_data_df["Interior_LivingArea"].unique())
['889000' '714400' '655200' ..., '231400' '251000' '250900']
['10' '1' '93569' '12800' '100' '0' '225000' '90000' '30000' '1000000'
 '940000' '215000' '879000' '60315' '187000' '903000' '405000' '925000'
 '48680' '1050000' '695000' '68000' '49000' '910000' '60000' '40000'
 '1800000' '769000' '38500' '300000' '265000' '532000' '760000' '375000'
 '75000' '200000' '45000' '660000' '38000' '340000' '1285000' '468000'
 '120000' '599000' '82000' '95000' '329000' '50709' '99' '247000' '220000'
 '775000' '270000' '31000' '113500' '550000' '432000' '1200000' '163000'
 '448169' '868000' '1380000' '85000' '25000' '960000' '107123' '15000'
 '1859000' '1008000' '305000' '750000' '950000' '41500' '192000' '2538764'
 '740000' '420000' '34000' '400000' '465000' '1250000' '125000' '565000'
 '171134' '65000' '212000' '27500' '1100000' '876400' '885000' '269000'
 '1286200' '500000' '1735000' '140000' '36800' '900000' '35000' '1311000'
 '117000' '166666' '170000' '2562500' '679000' '450000' '199000' '722000'
 '37000' '496000' '1550000' '1128000' '540000' '1500000' '70000' '216000'
 '155000' '185000' '116500' '1150000' '1430000' '1790000' '146000' '320000'
 '800000' '1600000' '990000' '1525000' '14500' '610000' '124000' '228000'
 '1825000' '110000' '647000' '530000' '235000' '497000' '123000' '68500'
 '1075000' '35500' '825000' '50000' '275000' '942900' '1570000' '255000'
 '160000' '100000' '1950000' '6357500' '460000' '112167' '264000' '33000'
 '145000' '969000' '1715000' '580000' '850000' '26000' '1750000' '10700'
 '33500' '13000' '425000' '18000' '343000' '205000' '34500' '179000'
 '1580000' '1260000' '983286' '1300000' '770000' '40500' '675000' '735000'
 '240000' '1302000' '3000' '1206000' '315000' '680000' '59400' '889000'
 '190000' '117500' '63058' '880000' '1275000' '813000' '55000' '920000'
 '1453500' '585000' '115000' '242500' '545000' '195000' '2050000' '670000'
 '395000' '2031038' '1546000' '125500' '1035000' '180000' '1440000'
 '1650721' '24000' '355000' '87000' '1632000' '363000' '55070' '210000'
 '688071' '2000000' '25500' '480000' '1332800' '757500' '10800' '905000'
 '1274000' '250000' '231500' '350000' '820000' '148000' '48000' '1307000'
 '470000' '602000' '1010000' '37500' '4' '1020000' '1400000' '243000'
 '245000' '47000' '252000' '213000' '222000' '1970000' '1110000' '776000'
 '378000' '372300' '2025000' '3247920' '57200' '14000' '22000' '1420000'
 '1450000' '1316250' '127000' '230000' '453500' '440000' '1175000' '417000'
 '1025000' '59900' '137500' '62000' '1475000' '366250' '1195000' '815000'
 '1358000' '663000' '655000' '575000' '105000' '615000' '12500' '248000'
 '412000' '1215000' '239000' '330000' '600000' '237000' '197500' '1130000'
 '141854' '10000' '840000' '175000' '1185000' '1015000' '1650000' '2880000'
 '1162000' '80000' '39000' '23000' '781000' '856000' '720000' '64000'
 '402500' '437250' '1276500' '875000' '169000' '32298' '394000' '92500'
 '262000' '1965000' '390000' '849000' '29000' '1666460' '259000' '1236000'
 '279600' '1900000' '391000' '28000' '99000' '685000' '91000' '149000'
 '1366000' '485100' '1104300' '1158000' '555000' '1700000' '1335000'
 '30416' '1310000' '590000' '1000' '83000' '78000' '184000' '57000'
 '1098000' '525000' '1375000' '182500' '27000' '1065000' '2200000' '132000'
 '845000' '650000' '7500' '479039' '534500' '310800' '349900' '93000'
 '53500' '97000' '150000' '121000' '409000' '115600' '1320000' '5000'
 '790000' '69000' '485000' '189500' '302500' '43000' '399000' '20000'
 '62500' '42500' '380000' '46900' '619000' '1365000' '524000' '473500'
 '192500' '152000' '135000' '225600' '735800' '178000' '415000' '196666'
 '384500' '579000' '45500' '1061250' '975000' '455000']
['5385' '4096' '3537' '3447' '3044' '2322' '3210' '3273' '2367' '3822'
 '3750' '3942' '3142' '2268' '2523' '4068' '4295' '2514' '3969' '5196'
 '3138' '1727' '3438' '2580' '5913' '2220' '3372' '4693' '3155' '4209'
 '1232' '2613' '2811' '7987' '1750' '3342' '4305' '3604' '2250' '2877'
 '4629' '2083' '3188' '3117' '4977' '4158' '4807' '3190' '4524' '1718'
 '2500' '3012' '3825' '3498' '2859' '3870' '3741' '3435' '4494' '3330'
 '3627' '3786' '3107' '1413' '4650' '2084' '5206' '2786' '2452' '3309'
 '3482' '3644' '3859' '3795' '3270' '2006' '2938' '1470' '3207' '3320'
 '2905' '3450' '3468' '2795' '3432' '3633' '3057' '2856' '3090' '4516'
 '3597' '4020' '3848' '3905' '2200' '3036' '2679' '3819' '4386' '5154'
 '4374' '3612' '4133' '3026' '3406' '4311' '2946' '3453' '3429' '3948'
 '3168' '3930' '2784' '3516' '3267' '3762' '3700' '2720' '3510' '3780'
 '4377' '3225' '3414' '2718' '2484' '1924' '2546' '4433' '1300' '3618'
 '4575' '3321' '2985' '3834' '3608' '2640' '4795' '2907' '3681' '3165'
 '3087' '3809' '4482' '2806' '1593' '3156' '2562' '3777' '3677' '2010'
 '4371' '1957' '3324' '3070' '3334' '2598' '2631' '4320' '3375' '3555'
 '3198' '2748' '2954' '4007' '3631' '4495' '1998' '3213' '3083' '3670'
 '3192' '2661' '3396' '4113' '1866' '5347' '5379' '4081' '3082' '2972'
 '2520' '4625' '5560' '3247' '3660' '2976' '3513' '3038' '3220' '4137'
 '3172' '4774' '2888' '2949' '5082' '3657' '2685' '2209' '2820' '3507'
 '3327' '3457' '3531' '3296' '2496' '4230' '2964' '3978' '3100' '4657'
 '4755' '4197' '3693' '3745' '4170' '2646' '2700' '2461' '3585' '2973'
 '3108' '3144' '3862' '4414' '2023' '1916' '4014' '4111' '3648' '2650'
 '3704' '2850' '3996' '1914' '3993' '4221' '4173' '3483' '3564' '2590'
 '3992' '3723' '3972' '3547' '3594' '3051' '4362' '5288' '3096' '4962'
 '1604' '5472' '4668' '3024' '5094' '3753' '2184' '3682' '3370' '3111'
 '2982' '2635' '3690' '2626' '3416' '3756' '3465' '3337' '4308' '2486'
 '3427' '4030' '4250' '2910' '3954' '4116' '2400' '3550' '3158' '3434'
 '3307' '1674' '3558' '4206' '5340' '3218' '3348' '5042' '3669' '3894'
 '3856' '2776' '3055' '3573' '4004' '2724' '3288' '3900' '3707' '4008'
 '2919' '4389' '3882' '4098' '2305' '4248' '2160' '4056' '4467' '3736'
 '2871' '2320' '3444' '4646' '5427' '3030' '3979' '3615' '2895' '2584'
 '3118' '3714' '3159' '3591' '3445' '4282' '3418' '3147' '3492' '2356'
 '2594' '2970' '1240' '3369' '4921' '3346' '2900' '3222' '3064' '3246'
 '3471' '3421' '2436' '3598' '3203' '3824' '3590' '3504' '3420' '2808'
 '3489' '3696' '3924' '3360' '3830' '3186' '2190' '4194' '3302' '3792'
 '3258' '2939' '3663' '2934' '2380' '1703' '3751' '2331' '3078' '2957'
 '3105' '2572' '2670' '5886' '2196' '3368' '3109' '3947' '3456' '3358'
 '2638' '2947' '2445' '2781' '4791' '3716' '3200' '2651' '1869' '2741'
 '3588' '2587' '2814' '3757' '2487' '2064' '2668' '3587' '2592' '2852'
 '3339' '3965' '3110' '2930' '3957' '3603' '3424' '3132' '4860' '4005'
 '3730' '4242' '1520' '3761' '3680' '4018' '4129' '2317' '2286' '5199'
 '3077' '2695' '4143' '2835' '3766' '2997' '2662' '3931' '1908' '3252'
 '4526' '3840' '4988' '2863' '4246' '3384' '2346' '4653' '3548' '4355'
 '3405' '3561' '2474' '4182' '2283' '2219' '2643' '3145' '3336' '3233'
 '2488' '3539' '1738' '4050' '3522' '3072' '3495' '1433' '3567' '3487'
 '3816' '4083' '4023' '6119' '3060' '4089' '3857' '3340' '3128' '2622'
 '3410' '4034' '2737' '2853' '3115' '4203' '3474' '4770' '3667' '1877'
 '1938' '3417' '3599' '4476' '4812' '3333' '2620' '3162' '2448' '3983'
 '3243' '3185' '3927' '3665' '2754' '2911' '2469' '5435' '3966' '3426'
 '3651' '4185' '4278' '3193' '3378' '4608' '4215' '2916' '3961' '2144'
 '2960' '2710' '4219' '3542' '3294' '2926' '3749' '3467' '3501' '4041'
 '2112' '6024' '3609' '3621' '3951' '3526' '3004' '2978' '4066' '4590'
 '4117' '5097' '3634' '2556' '3000' '3666' '2402' '2178' '4298' '1925'
 '2940' '3600' '1783' '3737' '4413' '3846' '2860' '2844' '2874' '2080'
 '3033' '3708' '3596' '3282' '3578' '2475' '3810' '3347' '3936' '4419'
 '4202' '4339' '2625' '3129' '2008' '3915' '2334' '3732' '2214' '3018'
 '3276' '3387' '3054' '3449' '4074' '3318' '2979' '3173' '3240' '3726'
 '3460' '3034' '2319' '4092' '3234' '3695' '3801' '2508' '2335' '2415'
 '4086' '5224' '2694' '3536' '3204' '3399' '5640' '2299' '2765' '3592'
 '3174' '4539' '1788' '3256' '2442' '3620' '2782' '3744' '4152' '2148'
 '2627' '3921' '2697' '3141' '3151' '4406' '3067' '4288' '3933' '3540'
 '3876' '3636' '3650' '2652' '3366' '3772' '4348' '2016' '4953' '3249'
 '3463' '4707' '3654' '3048' '3312' '4596' '3906' '2763' '3383' '4396'
 '3451' '3662' '3759' '4392' '2611' '3864' '3056' '4269' '2836' '1977'
 '3157' '2232' '3223' '4363' '2933' '3987' '3988' '3828' '3606' '3525'
 '1782' '3837' '3182' '4845' '2554' '4491' '5823' '3838' '2601' '2385'
 '4416' '4372' '3412' '2163' '2605' '2052' '2034' '2798' '3397' '5295'
 '2481' '4140' '2464' '2971' '2316' '2550' '3356' '4746' '3216' '3153'
 '3171' '2673' '2774' '3269' '3319' '3528' '2540' '4927' '3508' '4080'
 '3017' '3099' '2210' '3984' '4199' '2955' '2775' '2576' '3326' '2766'
 '2961' '4217' '2382' '1360' '4126' '3685' '3146' '4062' '4095' '3722'
 '3415' '3699' '3237' '1986' '4029' '3094' '3114' '2285' '3929' '3881'
 '2338' '5446' '3477' '3183' '5190' '3486' '4247' '3295' '2772' '2879'
 '3154' '3724' '3071' '4266' '4110' '2511' '5454' '3484' '3509' '3297'
 '3997' '5400' '3315' '3390' '2688' '2796' '2736' '2722' '2208' '2192'
 '5085' '4761' '2217' '2708' '3184' '4509' '5178' '2715' '2336' '2206'
 '4236' '4487' '4290' '2743' '2838' '3069' '3066' '2453' '3411' '2292'
 '5349' '2936' '3939' '2602' '3373' '4233' '2568' '2887' '4025' '2890'
 '3500' '2154' '2377' '3093' '2691' '5538' '1911' '4131' '2233' '4205'
 '3381' '2889' '3589' '6393' '1701' '4260' '4475' '3796' '2036' '4107'
 '3735' '3362' '2865' '4118' '3268' '3568' '2914' '3068' '3885' '5019'
 '4485' '2535' '2528' '4058' '5127' '7593' '3298' '3091' '2063' '3404'
 '1912' '2896' '1890' '3912' '2750' '2549' '1968' '2526' '2570' '2875'
 '3768' '3283' '6123' '3832' '3779' '3576' '3831' '3789' '3784' '3808'
 '3888' '3932' '2410' '3126' '2028' '4307' '2987' '3641' '4878' '1796'
 '2534' '3748' '2244' '3849' '2675' '4244' '2621' '2712' '3357' '2833'
 '3308' '3893' '4557' '2547' '3402' '2702' '5856' '2226' '2614' '2426'
 '4002' '2058']
In [407]:
raw_data_df[raw_data_df["Interior_LivingArea"]=='0'].head() 
Out[407]:
PID GISID BldgNum Address Unit StateClassCode PropertyClass Zoning Map/Lot LandArea ... Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition Condition_OverallCondition Condition_OverallGrade Parking_Open Parking_Covered Parking_Garage UnfinishedBasementGross FinishedBasementGross

0 rows × 61 columns

In [408]:
print("Size of entire table: %s "%len(raw_data_df))
print("Size of entires matching filter: %s "%len(raw_data_df[raw_data_df["Interior_LivingArea"]=='0']))
Size of entire table: 3603 
Size of entires matching filter: 0 
In [409]:
print("Size of entires matching filter: %s "%len(raw_data_df[raw_data_df["Interior_LivingArea"]!='0']))
raw_data_df[raw_data_df["Interior_LivingArea"]!='0'].head()
Size of entires matching filter: 3603 
Out[409]:
PID GISID BldgNum Address Unit StateClassCode PropertyClass Zoning Map/Lot LandArea ... Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition Condition_OverallCondition Condition_OverallGrade Parking_Open Parking_Covered Parking_Garage UnfinishedBasementGross FinishedBasementGross
0 13639 173-24 1 1 3 Walker Ter\rCambridge, MA\r(42.380042, -71... NaN 105 THREE-FM-RES B 173-24 9157 ... NaN 1886 Average Average Good 0 0 NaN 2340 0
1 10108 129-48 1 1 Blackstone St\rCambridge, MA\r(42.362134, -7... NaN 105 THREE-FM-RES C-1 129-48 2385 ... NaN 1916 Very Good Good Average Minus 0 0 NaN 1422 0
2 5772 100-91 1 1 Florence St\rCambridge, MA\r(42.357992, -71.... NaN 105 THREE-FM-RES C 100-91 4275 ... NaN 1902 Average Average Good 0 0 NaN 1173 0
3 12221 153-4 1 1 Frost St\rCambridge, MA\r(42.386567, -71.117... NaN 105 THREE-FM-RES B 153-4 6374 ... NaN 1886 Good Good Good Very Good 3 0 NaN 1619 0
4 9681 125-11 1 1 Jay St\rCambridge, MA\r(42.36364, -71.109705) NaN 105 THREE-FM-RES BA-3 125-11 2118 ... NaN 1902 Average Average Average 0 0 NaN 704 0

5 rows × 61 columns

In [410]:
processed_data_df = raw_data_df[raw_data_df["Interior_LivingArea"]!='0']
processed_data_df = processed_data_df[processed_data_df["Interior_LivingArea"]!='0']




print("Size of entire table: %s "%len(processed_data_df))

processed_data_df.head()
Size of entire table: 3603 
Out[410]:
PID GISID BldgNum Address Unit StateClassCode PropertyClass Zoning Map/Lot LandArea ... Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition Condition_OverallCondition Condition_OverallGrade Parking_Open Parking_Covered Parking_Garage UnfinishedBasementGross FinishedBasementGross
0 13639 173-24 1 1 3 Walker Ter\rCambridge, MA\r(42.380042, -71... NaN 105 THREE-FM-RES B 173-24 9157 ... NaN 1886 Average Average Good 0 0 NaN 2340 0
1 10108 129-48 1 1 Blackstone St\rCambridge, MA\r(42.362134, -7... NaN 105 THREE-FM-RES C-1 129-48 2385 ... NaN 1916 Very Good Good Average Minus 0 0 NaN 1422 0
2 5772 100-91 1 1 Florence St\rCambridge, MA\r(42.357992, -71.... NaN 105 THREE-FM-RES C 100-91 4275 ... NaN 1902 Average Average Good 0 0 NaN 1173 0
3 12221 153-4 1 1 Frost St\rCambridge, MA\r(42.386567, -71.117... NaN 105 THREE-FM-RES B 153-4 6374 ... NaN 1886 Good Good Good Very Good 3 0 NaN 1619 0
4 9681 125-11 1 1 Jay St\rCambridge, MA\r(42.36364, -71.109705) NaN 105 THREE-FM-RES BA-3 125-11 2118 ... NaN 1902 Average Average Average 0 0 NaN 704 0

5 rows × 61 columns

In [411]:
processed_data_df = processed_data_df[pd.notnull(processed_data_df["BuildingValue"])]
processed_data_df = processed_data_df[pd.notnull(processed_data_df["SalePrice"])]
processed_data_df = processed_data_df[pd.notnull(processed_data_df["Interior_LivingArea"])]
print("Size of entire table: %s "%len(processed_data_df)) 
processed_data_df.head()
Size of entire table: 3603 
Out[411]:
PID GISID BldgNum Address Unit StateClassCode PropertyClass Zoning Map/Lot LandArea ... Systems_Plumbing Condition_YearBuilt Condition_InteriorCondition Condition_OverallCondition Condition_OverallGrade Parking_Open Parking_Covered Parking_Garage UnfinishedBasementGross FinishedBasementGross
0 13639 173-24 1 1 3 Walker Ter\rCambridge, MA\r(42.380042, -71... NaN 105 THREE-FM-RES B 173-24 9157 ... NaN 1886 Average Average Good 0 0 NaN 2340 0
1 10108 129-48 1 1 Blackstone St\rCambridge, MA\r(42.362134, -7... NaN 105 THREE-FM-RES C-1 129-48 2385 ... NaN 1916 Very Good Good Average Minus 0 0 NaN 1422 0
2 5772 100-91 1 1 Florence St\rCambridge, MA\r(42.357992, -71.... NaN 105 THREE-FM-RES C 100-91 4275 ... NaN 1902 Average Average Good 0 0 NaN 1173 0
3 12221 153-4 1 1 Frost St\rCambridge, MA\r(42.386567, -71.117... NaN 105 THREE-FM-RES B 153-4 6374 ... NaN 1886 Good Good Good Very Good 3 0 NaN 1619 0
4 9681 125-11 1 1 Jay St\rCambridge, MA\r(42.36364, -71.109705) NaN 105 THREE-FM-RES BA-3 125-11 2118 ... NaN 1902 Average Average Average 0 0 NaN 704 0

5 rows × 61 columns

In [412]:
processed_data_df = processed_data_df[[
                                        'BuildingValue', 
                                        'SalePrice', 
                                        'Interior_LivingArea'
                                     ]].copy()
processed_data_df.head()
Out[412]:
BuildingValue SalePrice Interior_LivingArea
0 889000 10 5385
1 714400 1 4096
2 655200 1 3537
3 781900 1 3447
4 494200 93569 3044
In [413]:
processed_data_df = processed_data_df.rename(columns={
                                                        'BuildingValue': 'ValueOfBuilding', 
                                                        'SalePrice': 'SalePrice',
                                                        'Interior_LivingArea': 'SizeOfInterior'
                                                     })
processed_data_df.head()
Out[413]:
ValueOfBuilding SalePrice SizeOfInterior
0 889000 10 5385
1 714400 1 4096
2 655200 1 3537
3 781900 1 3447
4 494200 93569 3044
In [414]:
processed_data_df = processed_data_df.apply(pd.to_numeric, errors='coerce')

processed_data_df = processed_data_df.dropna()
processed_data_df.head()
Out[414]:
ValueOfBuilding SalePrice SizeOfInterior
0 889000 10 5385
1 714400 1 4096
2 655200 1 3537
3 781900 1 3447
4 494200 93569 3044
In [415]:
Property_lin_df = processed_data_df[[
                               'SalePrice', 
                               'ValueOfBuilding', 
                               'SizeOfInterior'
                               ]].copy()
Property_lin_df.head()
Out[415]:
SalePrice ValueOfBuilding SizeOfInterior
0 10 889000 5385
1 1 714400 4096
2 1 655200 3537
3 1 781900 3447
4 93569 494200 3044
In [416]:
data = Property_lin_df

data = data[data["SalePrice"]>100000]
data = data[data["SizeOfInterior"]>2000]
data = data[data["ValueOfBuilding"]>150000]


print(len(data))

holdout = data.sample(frac=0.05)
training = data.loc[~data.index.isin(holdout.index)]
1110
In [417]:
sns.lmplot(x="ValueOfBuilding", y="SalePrice", data=training, x_estimator=np.mean, order=1)
Out[417]:
<seaborn.axisgrid.FacetGrid at 0x1b7cc34908>
In [418]:
sns.lmplot(x="SizeOfInterior", y="SalePrice", data=training, x_estimator=np.mean, order=1)
Out[418]:
<seaborn.axisgrid.FacetGrid at 0x1b7b976198>
In [421]:
model = ols("SalePrice ~ SizeOfInterior + ValueOfBuilding", training).fit()

model.summary()
Out[421]:
OLS Regression Results
Dep. Variable: SalePrice R-squared: 0.346
Model: OLS Adj. R-squared: 0.345
Method: Least Squares F-statistic: 278.3
Date: Mon, 04 Dec 2017 Prob (F-statistic): 1.02e-97
Time: 22:05:15 Log-Likelihood: -15260.
No. Observations: 1054 AIC: 3.053e+04
Df Residuals: 1051 BIC: 3.054e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.435e+05 6.75e+04 8.051 0.000 4.11e+05 6.76e+05
SizeOfInterior -202.4167 23.383 -8.657 0.000 -248.299 -156.534
ValueOfBuilding 1.1011 0.048 22.968 0.000 1.007 1.195
Omnibus: 308.919 Durbin-Watson: 1.051
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1209.732
Skew: 1.352 Prob(JB): 2.04e-263
Kurtosis: 7.498 Cond. No. 3.74e+06
In [422]:
features_train = training.drop("SalePrice", axis=1).as_matrix(columns=None)
labels_train = training["SalePrice"].as_matrix(columns=None)

features_test = holdout.drop("SalePrice", axis=1).as_matrix(columns=None)
labels_test = holdout["SalePrice"].as_matrix(columns=None)

lm = linear_model.LinearRegression()
clf = lm.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = metrics.r2_score(labels_test, pred)
print("R squared:",lm.score(features_train,labels_train))
print("Accuracy:",accuracy)
R squared: 0.346219175385
Accuracy: 0.115427249333