This project will read in three different patent portfolios, delete any expired patents, and then evaluate the patents based on age, number of citations, and number of claims. If two patents have the same number of citations and number of claims, then the patent with the longest remaining term is ranked more valuable. For this particular project, I selected AMD's patent portfolio, as well as a subset of Google and Yahoo's portfolios as well for three different examples.
Using a local mysql database was taking too long to get the data, so I used the API from here http://www.patentsview.org/api/doc.html to generate my datasets. I took advantage of the wrapper for that API from here https://github.com/CSSIP-AIR/PatentsView-APIWrapper in order to do the queries and have the data converted from a json to a csv file for the input of my analysis.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import datetime
from datetime import datetime
# Fudge factors to manipulate the results. I played around with these, and ultimately settled on these values.
citation_score_factor = 2.0
claim_score_factor = 0.25
date_score_factor = 0.5
# Data files to work on
AMD_data_file = 'Patent_Portfolios - AMD.csv'
Google_data_file = 'Patent_Portfolios - Google.csv'
Yahoo_data_file = 'Patent_Portfolios - Yahoo.csv'
# Custom functions
# Boolean predicate to determine if a patent has already expired
def isExpired(patent_date):
# Add 20 years to the patent_date and then compare it with today
temp_date1=datetime.strptime(patent_date, '%Y-%m-%d')
temp_date1=temp_date1.replace(year=temp_date1.year+20)
return(temp_date1 < datetime.now())
# Function to flag expired patents to exclude them in the analyssi
def markExpired(patents):
# Iterate through the dataframe and add a boolean column indicating whehter a patent has expired
for i, row in patents.iterrows():
if isExpired(row['patent_date']):
patents.set_value(i, 'expired', 'True')
else:
patents.set_value(i, 'expired', 'False')
return patents
# Function to only get the non-expired patents for the analysis
def getNonExpired(patents):
# Get a dataframe with only unexpired patents
patents_sample_df = patents[patents["expired"] == "False"]
return patents_sample_df
# Function to score the patent claims based on the quantile it falls in
def scoreClaims(patents):
# Score the patents based on which quantile their number of claims fall into. This value will be saved as the patents claim score and will be part of the overall score later.
sorted_patents=patents.sort_values(by='patent_num_claims', ascending=False, kind='mergesort')
quantiles = sorted_patents['patent_num_claims'].quantile([.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1])
for i, row in sorted_patents.iterrows():
if row['patent_num_claims'] <= quantiles[.05]:
sorted_patents.set_value(i, 'claim_score', .05)
elif row['patent_num_claims'] <= quantiles[.1]:
sorted_patents.set_value(i, 'claim_score', .1)
elif row['patent_num_claims'] <= quantiles[.15]:
sorted_patents.set_value(i, 'claim_score', .15)
elif row['patent_num_claims'] <= quantiles[.2]:
sorted_patents.set_value(i, 'claim_score', .2)
elif row['patent_num_claims'] <= quantiles[.25]:
sorted_patents.set_value(i, 'claim_score', .25)
elif row['patent_num_claims'] <= quantiles[.3]:
sorted_patents.set_value(i, 'claim_score', .35)
elif row['patent_num_claims'] <= quantiles[.4]:
sorted_patents.set_value(i, 'claim_score', .4)
elif row['patent_num_claims'] <= quantiles[.45]:
sorted_patents.set_value(i, 'claim_score', .45)
elif row['patent_num_claims'] <= quantiles[.5]:
sorted_patents.set_value(i, 'claim_score', .5)
elif row['patent_num_claims'] <= quantiles[.55]:
sorted_patents.set_value(i, 'claim_score', .55)
elif row['patent_num_claims'] <= quantiles[.6]:
sorted_patents.set_value(i, 'claim_score', .6)
elif row['patent_num_claims'] <= quantiles[.65]:
sorted_patents.set_value(i, 'claim_score', .65)
elif row['patent_num_claims'] <= quantiles[.7]:
sorted_patents.set_value(i, 'claim_score', .7)
elif row['patent_num_claims'] <= quantiles[.75]:
sorted_patents.set_value(i, 'claim_score', .75)
elif row['patent_num_claims'] <= quantiles[.8]:
sorted_patents.set_value(i, 'claim_score', .8)
elif row['patent_num_claims'] <= quantiles[.85]:
sorted_patents.set_value(i, 'claim_score', .85)
elif row['patent_num_claims'] <= quantiles[.9]:
sorted_patents.set_value(i, 'claim_score', .9)
elif row['patent_num_claims'] <= quantiles[.95]:
sorted_patents.set_value(i, 'claim_score', .95)
elif row['patent_num_claims'] < quantiles[1]:
sorted_patents.set_value(i, 'claim_score', 1)
else:
# Give a higher value for the maximum of the set
sorted_patents.set_value(i, 'claim_score', 1.10)
return sorted_patents
# Function to score the patent citations based on the quantile they fall in
def scoreCitations(patents):
# Score the patents based on which quantile their number of citations fall into. This value will be saved as the patents citation score and will be part of the overall score later.
sorted_patents=patents.sort_values(by='patent_num_combined_citations', ascending=False, kind='mergesort')
quantiles = sorted_patents['patent_num_combined_citations'].quantile([.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1])
for i, row in sorted_patents.iterrows():
if row['patent_num_combined_citations'] <= quantiles[.05]:
sorted_patents.set_value(i, 'citation_score', .05)
elif row['patent_num_combined_citations'] <= quantiles[.1]:
sorted_patents.set_value(i, 'citation_score', .1)
elif row['patent_num_combined_citations'] <= quantiles[.15]:
sorted_patents.set_value(i, 'citation_score', .15)
elif row['patent_num_combined_citations'] <= quantiles[.2]:
sorted_patents.set_value(i, 'citation_score', .2)
elif row['patent_num_combined_citations'] <= quantiles[.25]:
sorted_patents.set_value(i, 'citation_score', .25)
elif row['patent_num_combined_citations'] <= quantiles[.3]:
sorted_patents.set_value(i, 'citation_score', .35)
elif row['patent_num_combined_citations'] <= quantiles[.4]:
sorted_patents.set_value(i, 'citation_score', .4)
elif row['patent_num_combined_citations'] <= quantiles[.45]:
sorted_patents.set_value(i, 'citation_score', .45)
elif row['patent_num_combined_citations'] <= quantiles[.5]:
sorted_patents.set_value(i, 'citation_score', .5)
elif row['patent_num_combined_citations'] <= quantiles[.55]:
sorted_patents.set_value(i, 'citation_score', .55)
elif row['patent_num_combined_citations'] <= quantiles[.6]:
sorted_patents.set_value(i, 'citation_score', .6)
elif row['patent_num_combined_citations'] <= quantiles[.65]:
sorted_patents.set_value(i, 'citation_score', .65)
elif row['patent_num_combined_citations'] <= quantiles[.7]:
sorted_patents.set_value(i, 'citation_score', .7)
elif row['patent_num_combined_citations'] <= quantiles[.75]:
sorted_patents.set_value(i, 'citation_score', .75)
elif row['patent_num_combined_citations'] <= quantiles[.8]:
sorted_patents.set_value(i, 'citation_score', .8)
elif row['patent_num_combined_citations'] <= quantiles[.85]:
sorted_patents.set_value(i, 'citation_score', .85)
elif row['patent_num_combined_citations'] <= quantiles[.9]:
sorted_patents.set_value(i, 'citation_score', .9)
elif row['patent_num_combined_citations'] <= quantiles[.95]:
sorted_patents.set_value(i, 'citation_score', .95)
elif row['patent_num_combined_citations'] < quantiles[1]:
sorted_patents.set_value(i, 'citation_score', 1)
else:
# Give a higher value for the maximum of the set
sorted_patents.set_value(i, 'citation_score', 1.10)
return sorted_patents
# Function to score the patents based on their dates. Slightly different than the previous two function because I couldn't get the quantiles to work with dates
def scoreDates(patents):
# Score the patents based on how much time remained in their term. This value will be saved as the patent's date score and will be part of the overall score later.
sorted_patents=patents.sort_values(by='patent_date', ascending=False, kind='mergesort')
# I couldn't get quantiles to work with dates, so I'm scoring dates bases on how much time is left in its' term. Anything less than two years is nearly worthless for enforcement.
two_years = datetime.now()
two_years = two_years.replace(year=two_years.year-18)
five_years = datetime.now()
five_years = five_years.replace(year=five_years.year-15)
ten_years = datetime.now()
ten_years = ten_years.replace(year=ten_years.year-10)
fifteen_years = datetime.now()
fifteen_years = fifteen_years.replace(year=fifteen_years.year-5)
for i, row in sorted_patents.iterrows():
temp_date1= datetime.strptime(row['patent_date'], '%Y-%m-%d')
if temp_date1 <= two_years:
sorted_patents.set_value(i, 'date_score', .2)
elif temp_date1 <= five_years:
sorted_patents.set_value(i, 'date_score', .4)
elif temp_date1 <= ten_years:
sorted_patents.set_value(i, 'date_score', .6)
elif temp_date1 <= fifteen_years:
sorted_patents.set_value(i, 'date_score', .8)
else:
sorted_patents.set_value(i, 'date_score', 1)
return sorted_patents
# Function to use all the other utility functions to score the patents and output them to a spreadsheet
def scorePatents(patents):
# Mark the expired patents
patents = markExpired(patents)
# Get the nonexpired patents
scored_patents = getNonExpired(patents)
# Score the claims
scored_patents = scoreClaims(scored_patents)
# Score the citations
scored_patents = scoreCitations(scored_patents)
# Score the Dates
scored_patents = scoreDates(scored_patents)
# Incorporate the score factors set earlier and add up the results
for i, row in scored_patents.iterrows():
total_score = (row['citation_score']*citation_score_factor)+(row['claim_score']*claim_score_factor)+(row['date_score']*date_score_factor)
scored_patents.set_value(i, 'patent_score', total_score)
sorted_data=scored_patents.sort_values(by='patent_score', ascending=False, kind='mergesort')
return sorted_data
# Utility function to write the results to an excel spreadsheet
def writeToExcel(AMD_df, Google_df, Yahoo_df):
writer = ExcelWriter('patent_results.xlsx')
AMD_df.to_excel(writer, sheet_name='AMD')
Google_df.to_excel(writer, sheet_name='Google')
Yahoo_df.to_excel(writer, sheet_name='Yahoo')
writer.save()
writer.close()
We are going to load in the data for each portfolio, and then display the head of each file. The input files are .csv files.
# Load and peek at the AMD data.
unsorted_AMD_data = pd.read_csv(AMD_data_file)
unsorted_AMD_data.head()
# Load and peek at the Google data.
unsorted_Google_data = pd.read_csv(Google_data_file)
unsorted_Google_data.head()
# Load and peek at the Yahoo data.
unsorted_Yahoo_data = pd.read_csv(Yahoo_data_file)
unsorted_Yahoo_data.head()
We are going to score each portfolio and output them to an excel spreadsheet. Each portfolio will be written to a different sheet within the workbook.
We are using the following three different scoring criteria:
patent_num_combined_citations - a measure of how many other patent applications have referenced the patent patent_num_claims - a measure of the total number of claims in the patent patent_date - a measure of how much time remains in the patent term.
As discussed during my presentation, patent_num_combined_citations is traditionally the most accurate factor in determining a patent's value. Hence why I weighted it far heavier than the other two criteria. I included the other criteria because (1) if I only used patent_num_combined_citations then this would simply be an exercise of sorting on that field and (2) the other criteria are important in determining a patent's worth. For example, a patent with ten years remaining would be far more valuable than a patent with two years remaining all other things being equal.
Initially, I wanted to evaluate the length of the first independent claim because the shorter the claim is usually an indication that it is easier to infringe the claim. This is where my issues with MySQL came into play. It was taking more than two minutes per query to retrieve each first independent claim for each patent. With the size of the portfolios, this wasn't practical, so I will leave that for future work. I did start to play around with a GPU powered database called 'mapd,' which promised much faster response time, but I wasn't very familiar with it, and I might spend far too much time just configuring it and not working on this project.
The 'scorePatents' function goes through a portfolio and scores the three criteria I discussed above. I run this on each of the dataframes I have already created.
unsorted_AMD_data = scorePatents(unsorted_AMD_data)
unsorted_AMD_data.head()
unsorted_Google_data = scorePatents(unsorted_Google_data)
unsorted_Google_data.head()
unsorted_Yahoo_data = scorePatents(unsorted_Yahoo_data)
unsorted_Yahoo_data.head()
Write the final results out to the excel file. Each portfolio will be written to its own worksheet. The scored patent portfolios will be in a file called 'patent_results.xlsx.'
Initially, I wanted to use Google sheets, but talking with the partner I was working with, we decided to use excel because everyone in our group is more familiar with excel.
writeToExcel(unsorted_AMD_data, unsorted_Google_data, unsorted_Yahoo_data)
I asked several people in my group for feedback from this solution. One suggestion that I received and incorporated into this project is to use the date criterion as one of the scoring factors. I think this can be refined a bit more, but in our particular group we find ourselves on the plaintiff/complainant side of things and having at least a two year term remaining on a patent allows a district court and International Trade Commission play as well.
Additionally, users requested some visualization plots, but that is beyond the initial scope of this project and I have added them to the Future Work section.
In addition, I have presented the final project to my partner, and he has expressed that it meets his expectations. The email correspondence can be found in this p4 directory as 'McNamra_Email.pdf.'
In the future, I would like to incorporate the GPU database I discussed earlier. Additionally, I'd like to evaluate the patents based on the length of the first independent claim. Finally, I'd like to add some visualizations to help convey how the ranking works.