• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 13 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Parse and cluster USPTO patent data. Includes applications, grants, assignments, and maintenance.

Fastpat

Fetch and parse patent application, grant, assignment, and maintenance info from USPTO Bulk Data. This handles all patent formats and outputs to pure CSV. Clusters patents by firm name, first filtering using locality-sensitive hashing, then finding components induced by a Levenshtein distance threshhold.

Requirements

In general, you'll need the fire library. For parsing, you'll need: numpy, pandas, and lxml. For firm clustering, you'll additionally need: xxhash, editdistance, networkx, and Cython. All of these are available through both pip and conda. You can install all the requirements with pip by running: pip install -r requirements.txt.

Usage

Most common tasks can be executed through the fastpat command. For more advanced usage, you can also directly call the functions in the library itself. When using fastpat you have to specify the data directory. You can either do this by passing the --datadir flag directly or by setting the environment variable FASTPAT_DATADIR. If you've cloned the repository locally, you have to run python3 -m fastpat instead of fastpat.

Downloading Data

The following USPTO data sources are supported

  • grant: patent grants
  • apply: patent applications
  • assign: patent resassignments
  • maint: patent maintenance events
  • tmapply: trademark applications (preliminary)

To download the files for data source SOURCE, run the command

fastpat fetch SOURCE

This library ships with a list of source files for each type, however this will become out of date over time. As such, you can also specify your own metadata path containing these files. You can do this by passing the --metadir flag directly or by setting the FASTPAT_METADIR environment variable. If you've cloned this repository locally, you can also update the files in fastpat/meta.

Parsing Data

Parsing works similarly to fetching. Simply run

fastpat parse SOURCE

for one of the sources listed above.

Firm Clustering

This step is a bit more bespoke, and you may want to change things to suit your needs. But in general, there are four subcommands you can pass to fastpat firms: assign which eliminates duplicate or redundant patent transfers from the reassignment data, cluster which groups firm names into common entities using locality sensitive matching and Levenshtein distance, cites which aggregates citation data to the patent level, and merge which brings it all together into a firm-year panel. The simplest thing is to simply run these subcommands in order.

Example

Suppose you just want to parse patent grants. To do this, you would go through the following steps:

  1. Set up the environment with export FASTPAT_DATADIR=data
  2. Fetch the grant data with fastpat fetch grant
  3. Parse the grant data with fastpat parse grant
  4. Cluster firm names with fastpat firms cluster --sources grant
  5. Process citations with fastpat firms cites

If you want to work with applications, grants, reassignment, and maintenance, you can run the following

  1. Set up the environment with export FASTPAT_DATADIR=data
  2. Fetch all the data with fastpat fetch SOURCE for each of SOURCE in apply, grant, assign, maint (four separate commands)
  3. Parse all the data with fastpat parse SOURCE for each of SOURCE in apply, grant, assign, maint (four separate commands)
  4. Prune the resassignment data with fastpat firms assign
  5. Cluster firm names with fastpat firms cluster --sources apply,grant,assign,maint
  6. Process citations with fastpat firms cites
  7. Merge into firm-year panel with fastpat firms merge

Data Updates

Continual data updating works very well for applications and grants. Only new files will be downloaded and unzipped. The way the patent office constructs the assignment data means that you'll have to delete it and re-download it roughly once a year. Similarly, maintenance information is stored in a single file, so to update that, you'll need to delete the data file raw/maint/MaintFeeEvents.zip and rerun the fetch command.

The parsing code will also only parse new files. If you wish to rerun the parsing step for a given file, either delete its outputs (in the parsed data directory) or pass the --overwrite flag (this works for the fetching step too). The clustering and merging steps must be run for any update to propagate the changes throughout. These will take about the same amount of time even for small updates, as they are undertaking global computations. Every command is idempotent, meaning it can be rerun without breaking anything.

Migration

If you've been using older versions of this repository, the new data layout is slightly different. To avoid having to re-download everything, you can move the contents of your data directly to data/raw and use data as the data directory path that you pass to fastpat. It's probably best to then re-parse everything and remove the parsed and tables directories.

More Repositories

1

miniview

GNOME Shell plugin that displays a mini window preview (like picture-in-picture on a TV)
JavaScript
106
star
2

fastreg

Fast sparse regressions with advanced formula syntax. OLS, GLM, Poisson, Maxlike, and more. High-dimensional fixed effects.
Jupyter Notebook
60
star
3

data_science

Getting started working with data, with applications to economics
Jupyter Notebook
12
star
4

elltwo

Collaborative technical document creation: SQLite backend, browser frontend. Markdown, math, images, references, citations. Full text search.
JavaScript
8
star
5

patents_analyze

Patent analysis collection.
Jupyter Notebook
6
star
6

macro_data

Intro to working with macroeconomic data
Jupyter Notebook
5
star
7

speedcam

Track car speeds from video using YOLOv5 + Kalman filter
Jupyter Notebook
5
star
8

econsir

Econ-SIR model implementation. Includes simulation, estimation, optimal policy, and dashboard.
Python
3
star
9

patents_china

Match Chinese firm data to Chinese patent data
Python
3
star
10

fuzzy

Note taking with fuzzy search
JavaScript
3
star
11

ellone

Browser-based tool for mathy markdown composition. Video below, live demo at http://dohan.io/
JavaScript
3
star
12

embed

Text Embeddings: A Practical Overview
Jupyter Notebook
2
star
13

gum.js

Grammar for SVG creation.
JavaScript
1
star
14

meteo

Homotopy solver in Python. Provides analytic derivatives.
Python
1
star
15

wikidiff

Generate word differentials from wiki history
Python
1
star
16

xml2csv

Extract flat data from XML files and output to CSV
Python
1
star
17

console

Real-time browser based visualization and interactivity
Python
1
star