• Stars
    star
    777
  • Rank 58,500 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🔖 A toolkit for making domain-specific probabilistic parsers

parserator

A toolkit for making domain-specific probabilistic parsers

Build Status

Do you have domain-specific text data that would be much more useful if you could derive structure from the strings? This toolkit will help you create a custom NLP model that learns from patterns in real data and then uses that knowledge to process new strings automatically. All you need is some training data to teach your parser about its domain.

What does probabilistic parser do?

Given a string, a probabilistic parser will break it out into labeled components. The parser uses conditional random fields to label components based on (1) features of the component string and (2) the order of labels.

When is a probabilistic parser useful?

A probabilistic parser is particularly useful for sets of strings that may have common structure/patterns, but which deviate from those patterns in ways that are difficult to anticipate with hard-coded rules.

For example, in most cases, addresses in the United States start with a street number. But there are exceptions: sometimes valid U.S. addresses deviate from this pattern (e.g., addresses starting with a building name or a P.O. box). Furthermore, addresses in real data sets often include typos and other errors. Because there are infinitely many patterns and possible typos to account for, a probabilistic parser is well-suited to parse U.S. addresses.

With a probabilistic (as opposed to a rule-based approach) approach, the parser can continually learn from new training data and thus continually improve its performance!

Some other examples of domains where a probabilistic parser can be useful:

  • addresses in other countries with unfamiliar conventions
  • product names/descriptions (e.g., parsing phrases like "Twizzlers Twists, Strawberry, 16-Ounce Bags (Pack of 6)" into brand, item, flavor, weight, etc.)
  • citations in academic writing

Examples of parserator

Try out these parsers on our web interface!

How to make a parser - quick overview

For more details on each step, see the parserator documentation.

  1. Initialize a new parser

    pip install parserator
    parserator init [YOUR PARSER NAME]
    python setup.py develop
    
  2. Configure the parser to your domain

    • configure labels (i.e., the set of possible tags for the tokens)
    • configure the tokenizer (i.e., how a raw string will be split into a sequence of tokens to be tagged)
  3. Define features relevant to your domain

    • define token-level features (e.g., length, casing)
    • define sequence-level features (e.g., whether a token is the first token in the sequence)
  4. Prepare training data

    • Parserator reads training data in XML format
    • To create XML training data output from unlabeled strings in a CSV file, use parserator's command line interface to manually label tokens. It uses values in first column, and it ignores other columns. To start labeling, run parserator label [infile] [outfile] [modulename]
    • For example, parserator label unlabeled/rawstrings.csv labeled_xml/labeled.xml usaddress
  5. Train your parser

    • To train your parser on your labeled training data, run parserator train [traindata] [modulename]
    • For example, parserator train labeled_xml/labeled.xml usaddress or parserator train "labeled_xml/*.xml" usaddress
    • After training, your parser will have an updated model, in the form of a .crfsuite settings file
  6. Repeat steps 3-5 as needed!

How to use your new parser

Once you are able to create a model from training data, install your custom parser by running python setup.py develop.

Then, in a Python shell, you can import your parser and use the parse and tag methods to process new strings. For example, to use the probablepeople module:

>>> import probablepeople
>>> probablepeople.parse('Mr George "Gob" Bluth II')
[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]

Important Links

Team

Errors and Bugs

If something is not behaving intuitively, it is a bug and should be reported. Report an issue.

Patches and Pull Requests

We welcome your ideas! You can make suggestions in the form of GitHub issues (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request.

How to contribute code:

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request with a description of your work! Don't worry if it isn't perfect: think of a PR as a start of a conversation rather than a finished product.

Copyright and Attribution

Copyright (c) 2016 DataMade. Released under the MIT License.

More Repositories

1

usaddress

🇺🇸 a python library for parsing unstructured United States address strings into address components
Python
1,451
star
2

census

A Python wrapper for the US Census API.
Python
569
star
3

probablepeople

👪 a python library for parsing unstructured western names into name components.
Python
564
star
4

data-making-guidelines

📘 Making Data, the DataMade Way
HTML
285
star
5

site-launch-checklist

☑️ A checklist of miscellaneous tasks to do before launching a public website.
124
star
6

how-to

📚 Doing all sorts of things, the DataMade way
Python
86
star
7

my-reps

👥 Enter your address to find your elected representatives. Powered by the Google Civic Information API
JavaScript
67
star
8

searchable-map-template-csv

🌎 You want to put your data on a searchable, filterable map. This is a free, open source template using Leaflet & Turf to help you do it.
JavaScript
53
star
9

census_area

🔷 Get Census Data from the API for arbitrary areas
Python
43
star
10

million-dollar-blocks

💰 An interactive visualization of incarceration spending in Chicago
JavaScript
43
star
11

nyc-councilmatic

🗽 a web app for keeping tabs on city council activity in New York City
Python
38
star
12

look-at-cook

Explore Cook County's budget from 1993 to 2017 and learn how the money is being spent.
JavaScript
36
star
13

data-analysis-guidelines

📒 Analyzing Data, the DataMade Way
Makefile
36
star
14

second-city-zoning

🏙 2nd City Zoning is an interactive map that lets you find out how your building is zoned, learn where to locate your business and explore zoning patterns throughout Chicago
HTML
29
star
15

django-councilmatic

💗 Django app providing core functions for *.councilmatic.org
Java
26
star
16

vacant-building-finder

Searchable map of Chicago vacant and abandoned buildings, built using open 311 data.
JavaScript
25
star
17

ny-budget

📊 an explorable budget vizualization for New York state
JavaScript
24
star
18

chi-councilmatic

👀 keep tabs on Chicago city council
Java
20
star
19

searchable-map-template-carto

🌎 You want to put your data on a searchable, filterable map. This is a free, open source template using CARTO to help you do it.
JavaScript
19
star
20

dossier

Machine assisted dossiers
TeX
18
star
21

lascaux

🌐 Web API for printing high resolution PDF maps
Python
17
star
22

councilmatic-starter-template

📋 Starter code & documentation for new councilmatic instances
Python
16
star
23

code-challenge

A code challenge to recreate the address parsing form in DataMade's Parserator app.
Python
15
star
24

testing-guidelines

📕 Writing tests, the DataMade way
14
star
25

django-proxy-overrides

Python
12
star
26

pdf-textextract

Docker Container for a Make-based, PDF extraction using OCR
Python
10
star
27

school-report-cards

Tools for parsing annual school report card data from the state of Illinois
Makefile
10
star
28

mecharat

🐀 Robot Assisted Transcriptions
Python
10
star
29

readme-template

📖 A template for documenting your code.
10
star
30

scrapers-us-municipal

Scrapers for US municipal governments.
Python
10
star
31

probation-resources-map

Interactive, searchable map that helps people on probation find social, health, and cultural resources throughout Chicago
JavaScript
10
star
32

open-ee-meter

Data analysis & visualization of energy savings projects, to ultimately empower utilities and contractors to improve the efficacy of energy savings programs.
Python
9
star
33

flask_app_template

Generic Flask app template with basic database setup and user login
HTML
8
star
34

bankers-hours

Python decorators that only let a function or method run within specified times
Python
8
star
35

govinfo

Scrapes Congressional Hearings from the govinfo API
Python
8
star
36

repo-roundup

📝 python script that pulls a list of organization repos from the GitHub API into a CSV
Python
7
star
37

nyc-council-councilmatic

NYC Council version of Councilmatic
Python
7
star
38

clearstreets-processing

Converts GPX files to OSM (Open Street Map) and snaps lines to city street grid
Python
7
star
39

jekyll-html-hook

⚓ Webhook listener for deploying Jekyll or static HTML sites
Python
7
star
40

just-spaces

🏕 A tool from University City District and DataMade to promote better and more just public spaces
JavaScript
7
star
41

bigeasy-budget

📈 Budget visualization for the City of New Orleans. Based on Look at Cook.
JavaScript
6
star
42

la-metro-councilmatic

🚇 An instance of councilmatic for LA Metro
Python
6
star
43

clearstreets-web

Website that tracks where Chicago plows have been during a snowstorm.
JavaScript
6
star
44

chicago-municipal-elections

Chicago Municipal Elections
Python
6
star
45

django-geomultiplechoice

🗺 A Django widget to select multiple geographic areas
Python
5
star
46

leaflet-searchable-map-template

JavaScript
5
star
47

ward-demographics

Census Demographics of Chicago's Wards
Python
5
star
48

chicago-council-scrapers

Repo for running Chicago City Council Scrapers
Python
5
star
49

occrp-timeline-tool

Help reporters organize and analyze data about sequential events and related data
Python
5
star
50

school-boundary-merge

Scripts and source data to create one file with all the attendance boundaries for all public schools in Chicago
Python
4
star
51

arrests

Chicago Police Arrests
Python
4
star
52

eitc-map

Earned Income Tax Credit Map for Voices for Illinois Children
JavaScript
4
star
53

probableparsing

Common methods for probable parsers
Python
4
star
54

look-at-washington

Budget transparency site for the State of Washington
JavaScript
4
star
55

justice-divided

⚖ Youth of all races break the law. Youth of color are more likely to be punished.
JavaScript
4
star
56

bga-payroll

💰 How much do your public officials make?
JavaScript
4
star
57

la-metro-dashboard

An Airflow-based dashboard for LA Metro
Python
3
star
58

schema_matching

Notes for schema matching
TeX
3
star
59

chicago-lots

Explore City-owned lots in Chicago
JavaScript
3
star
60

where-we-work

visualizing Chicago employment data
JavaScript
3
star
61

process-mining-toolkits

Evaluating Process Mining Toolkits
Makefile
3
star
62

openness-project-nmid

The Openness Project - New Mexico In Depth Campaign Finance Explorer
Python
3
star
63

where-we-work-api

💼 Scripts and API to process and analyze LODES data for where-we-work
Python
3
star
64

macoupin-budget

📊 Budget visualization for Macoupin County, IL. Based on Look at Cook
JavaScript
3
star
65

my-reps-pbp

Look up who represents you at various levels of government.
JavaScript
3
star
66

illinois-criminal-justice

Data extractions tools from Illinois State Reports on the Criminal Justice System
Makefile
3
star
67

mote-0-bike

A bicycle-based sensing and data gathering project, intended to better map the urban environment as it is experienced by the bicyclist and pedestrian. GFRY design objects studio project by Colin Hutton
JavaScript
3
star
68

chicago-elections

API for local chicago election
Python
2
star
69

court-scrapers

Python
2
star
70

chicago-newsarticles

Python
2
star
71

chicago-openelex

Chicago Open Election Scraper
Python
2
star
72

wopr-data

Deprecated: Scripts for creating and updating datasets for plenario
Python
2
star
73

python-chicago-elections

Python Scraper for Chicago Elections
HTML
2
star
74

opmaalinger

Completed development projects in Denmark
JavaScript
2
star
75

alderman-factors

Estimating the poltical structure of the Chicago council from campaign contributions
JavaScript
2
star
76

ipeds-db

Build a Postgres DB from IPEDS Access files
Makefile
2
star
77

django-pldp

Reusable Django app that implements data models for PLDP
Python
2
star
78

disasters

Data on the current and forecasted conditions for storms, flooding, earthquakes, fires, and other disasters
2
star
79

bravest-map-ever

Searchable map highlighting places that inspire, and encourage bravery and kindness in your community for the Born This Way Foundation.
JavaScript
2
star
80

million-dollar-blocks-analysis

public repo with analysis for the Chicago million dollar blocks project
Python
2
star
81

car-scraper

💲Make spreadsheets out of Chicago Association of REALTORS® reports
Python
2
star
82

ihs-price-index

Price Index Visualization for The DePaul Institute for Housing Studies
JavaScript
2
star
83

how-to-recharts

A learning repo for setting up Recharts with Gatsby
CSS
2
star
84

illinois-ucr

Tools for working with Illinois crime reports.
Python
2
star
85

bga-pensions

🏦 Funding public-employee pension systems is perhaps the most vexing emergency facing Illinois taxpayers. @bettergov hopes to bring clarity to this important topic by gathering and centralizing data from the largest public pension systems in the state.
CSS
2
star
86

semabot

Whomping up a webhook for notifications for semaphor
Python
2
star
87

sra

Mirror of Security Risk Assessment Tool from HHS
HTML
2
star
88

coordinated-entry-screening

A pre-screening tool for Coordinated Entry Access for the Corporation for Supportive Housing (CSH)
Python
2
star
89

cps-data

Repository of data on the Chicago Public Schools and District, complement to https://github.com/datamade/school-report-cards
Makefile
1
star
90

where-to-drink

Where can I get a drink around here?
Makefile
1
star
91

gary-counts-hhmap

Map of the Hardest Hit properties in the City of Gary, IN
CSS
1
star
92

large-lots-staging

Staging site for LargeLots.org
JavaScript
1
star
93

property-image-cache

🏠 A Flask app to cache images from the Cook County Property info site
Python
1
star
94

django-councilmatic-notifications

Notifications app for Django Councilmatic
Python
1
star
95

legistar-people

Using the legistar-scraper to get a list of elected officials
Python
1
star
96

govqa-py

Python
1
star
97

plats

Federal Township Plat Map for Cook County
Makefile
1
star
98

nwss-data-standard

💧 A marshmallow schema for the National Wastewater Surveillance System
Python
1
star
99

static-app-template

A template for creating static apps with Gatsby.
1
star
100

990-db

ETL for IRS 990 Filings on AWS
Makefile
1
star