• Stars
    star
    84
  • Rank 389,211 (Top 8 %)
  • Language
    HTML
  • Created about 11 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tooling to extract data from scanned paper forms OCR-ed by Tesseract using the HOCR standard.

More Repositories

1

990-xml-reader

IRSx: Turn the IRS' versioned XML 990 nonprofit annual tax returns into standardized python objects, json, or human readable text with original line number and description.
Python
118
star
2

parsing-prickly-pdfs

NICAR 2016 talk about PDFs!
62
star
3

covid_hospitals_demographics

COVID-19 relevant data on hospital location / capacity, nursing home location / capacity, county demographics
HTML
24
star
4

990-xml-database

Django app to consume and store 990 data and metadata
Python
22
star
5

pdf17

nicar 17: advanced pdf manipulation
17
star
6

irsx_cookbook

IRSX Cookbook
Jupyter Notebook
16
star
7

pdf_bbox_utils

Helpers to create .csv files of word-level bounding boxes from text-based pdfs, or from hocr output.
Python
7
star
8

990-xml-metadata

metadata describing the 990 xml release, to be used by 990-xml-reader and related projects
7
star
9

plpython_textmatch

Add some fuzzy string match operations to postgreSQL
7
star
10

pdf20

Advanced PDF manipulation with pdfplumber for NICAR 2020 / New Orleans
Jupyter Notebook
6
star
11

doc-wrangler

Noodle with document cloud
Python
5
star
12

texas_rrc

some railroad commission oil / gas production files
5
star
13

reconcile-legislators

Test open refine reconciliation service to match legislators names
Python
5
star
14

paper_fec

Parse the OCR'ed paper FEC filings (as well as the electronic ones)
Python
5
star
15

nicar-nonprofit-datarelease

Documentation for nonprofit data released at NICAR 2020
5
star
16

easy-stats-113

Data from the census bureau's "easy stats" site--the first available on the 113th Congress.
Python
4
star
17

freefcc

Python
4
star
18

house_disbursements

muck with sunlight house disbursement csvs
Python
3
star
19

senate_disbursements

process--partially--the senate clerk's report on spending.
Python
2
star
20

inspectfile

like inspectdb, but for files
Python
2
star
21

irs_527

proces 527 data to csvs
Python
2
star
22

legacy_0809_acs_exporter

Legacy export of ACS processing from 2008 3-year ACS for R and PostgreSQL
Python
2
star
23

990-xml-admin

Keep tabs on 990 filings
HTML
1
star
24

fec_ftp

another bucket of scripts for grabbing the fec's ftp data etc for django + postgres
Python
1
star