• Stars
    star
    104
  • Rank 330,604 (Top 7 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created about 11 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Collecting reports from Inspectors General across the US federal government.

Inspectors General

A project to collect reports from the offices of Inspectors General across the US federal government.

For more information about the project, read:

What's an inspector general?

From one of the above pieces:

Just about every agency in the federal government has an independent unit, usually called the Office of the Inspector General, dedicated to independent oversight. This includes regular audits of the agency's spending, monitoring of active government contractors and investigations into wasteful or corrupt agency practices. They ask tough questions, carry guns, and sue people.

How you can help

The initial round of writing scrapers for all 65 federal IGs has come to a close. However, there are two important areas we need help in:

Ask @konklone for an invitation to the project Slack if you want to talk with teammates and get involved.

  • Just as importantly, sending in reports we can't scrape.

There are 9 IGs who do not publish reports online, many from the US government's intelligence community.

Generally, getting their reports means filing Freedom of Information Act requests, or finding the results of FOIA requests others have already made.

We also need unpublished reports from the other 65 IGs! We're scraping what they publish online, but most IGs do not proactively publish all of their reports.

Submitting IG reports

We don't yet have a formal process for submitting reports — for now, either open an issue and post a link to the file, or email the report to [email protected].

Scraping IG reports

Python 3: This project uses Python 3, and is tested on Python 3.4.0. If you don't have Python 3 installed, check out pyenv and pyenv-virtualenvwrapper for easily installing and switching between multiple versions of Python.

Dependencies:

  • To extract PDFs (the most common type of report), you'll need pdftotext, pdfinfo, and qpdf. On Ubuntu, apt-get install poppler-utils qpdf. On OS X, brew install poppler qpdf.
  • To extract DOCs, you'll need abiword, which you can install via apt-get or brew.
  • Install all the PIP dependencies by running pip install -r requirements.txt

To run an individual IG scraper, just execute its file directly. For example:

./inspectors/usps.py

This will fetch the current year's reports from the Inspector General for the US Postal Service and write them to disk, along with JSON metadata.

If you want to go back further, use --since or --year to specify a year or range:

./inspectors/usps.py --since=2009

If you want to run multiple IG scrapers in a row, use the igs script:

./igs

By default, the igs script runs all scrapers. It takes the following arguments:

  • --safe: Limit scrapers to those declared in safe.yml. The idea is for "safe" scrapers to be appropriate for clients who wish to fully automate their report pipeline, without human intervention when new IGs are added, in a stable way.
  • --only: Limit scrapers to a comma-separated list of names. For example, --only=opm,epa will run inspectors/opm.py and inspectors/epa.py in turn.
  • --data-directory: The directory path to store the output files. Defaults to data in the current working directory.

Using the data

Reports are broken up by IG and by year. So a USPS IG report from 2013 with a scraper-determined ID of no-ar-13-010 will create the following files:

/data/usps/2013/no-ar-13-010/report.json
/data/usps/2013/no-ar-13-010/report.pdf
/data/usps/2013/no-ar-13-010/report.txt

Metadata for a report is at report.json. The original report will be saved at report.pdf (the extension will match the original, it may not be .pdf). The text from the report will be extracted to report.txt.

Common options

Every scraper will accept the following options:

  • --year: A YYYY year, only fetch reports from this year.
  • --since: A YYYY year, only fetch reports from this year onwards.
  • --debug: Print extra output to STDOUT. (Can be quite verbose when downloading.)
  • --dry_run: Will scrape sites and write JSON metadata to disk, but won't download full reports or extract text.

Report metadata

Every report has an accompanying JSON file with metadata. That JSON file is an object with the following required fields:

  • inspector - The handle you chose for the IG. e.g. "usps"
  • inspector_url - The IG's primary website URL.
  • agency - The handle of the agency the report relates to. This can be the same value as inspector, but it may differ -- some IGs monitor multiple agencies.
  • agency_name - The full text name of an agency, e.g. "United States Postal Service"
  • report_id - A string usable as an ID for the report.
  • title - Title of report.
  • published_on - Date of publication, in YYYY-MM-DD format.

Additionally, some information about report URLs is required. However, not all report contents are released: some are sensitive or classified, or require a FOIA request to obtain. Use these fields to handle report URLs:

  • url - URL to the report itself. Required unless unreleased is True.
  • landing_url - URL to some kind of landing page for the report.
  • unreleased - Set to True if the report's contents are not fully released.

If unreleased is True, then url is optional and landing_url is required.

The JSON file may have arbitrary additional fields the scraper author thought worth keeping.

The report_id must be unique within that IG, and should be stable and idempotent.

Bulk data and backup

This project's chief maintainer, Eric Mill, runs a copy of this project on a server that automatically backs up the downloaded bulk data.

Data is backed up to the Internet Archive.

To back up individual reports as items in the collection, run the backup script:

./backup

This goes through all reports in data/ for which a report has been released (in other words, where unreleased is not true), and uploads their metadata and report data to the Internet Archive.

For example, the treasury IG's 2014 report OIG-14-023 report can be found at:

https://archive.org/details/us-inspectors-general.treasury-2014-OIG-14-023

To generate bulk data, the following command is run from the project's output data/ directory.

zip -r ../us-inspectors-general.bulk.zip * -x "*.done"
cd ..
./backup --bulk=us-inspectors-general.bulk.zip

Both zipping and uploading take a long time -- this is a several-hour process at minimum.

The process zips up the contents of the data/ directory, while excluding any .done files that track the status of individual file backups. The zip file is placed up one directory, so that it doesn't interfere with the automatic directory examination of data/ that many scripts employ.

Then the file is uploaded to the Internet Archive as part of the collection, to be a convenient bulk mirror of the entire thing.

[TBD: Proper collection landing page, and bulk data link.]

Resources

Public domain

This project is dedicated to the public domain. As spelled out in CONTRIBUTING:

The project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

More Repositories

1

congress-legislators

Members of the United States Congress, 1789-Present, in YAML/JSON/CSV, as well as committees, presidents, and vice presidents.
Python
1,927
star
2

congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
Python
847
star
3

contact-congress

Sending electronic written messages to members of Congress by reverse engineering their contact forms.
Python
630
star
4

python-us

A package for easily working with US and state metadata
Python
479
star
5

districts

GeoJSON and other shape files for the federal legislative districts of the US.
260
star
6

citation

Legal citation extractor, via command line, JavaScript, or HTTP. See a live example at:
JavaScript
213
star
7

images

Public domain photos of Members of the United States Congress
Python
173
star
8

congressional-record

A parser for the Congressional Record.
HTML
119
star
9

uscode

A working parser for the US Code's hierarchy, and a work-in-progress parser for the full content.
Python
101
star
10

APIs

A Hub of US Government APIs
CSS
59
star
11

bill-nicknames

Table of popular nicknames and keywords for bills, curated manually.
56
star
12

uslaw.link

A legal citation resolver.
JavaScript
54
star
13

unitedstates.github.io

Simple homepage for this organization.
CSS
50
star
14

glossary

A glossary for the United States.
Ruby
42
star
15

acronym

A library of government acronyms
39
star
16

orgchart

An organization chart for the government of the United States.
37
star
17

federal_spending

Importer for US Spending data
Python
34
star
18

congress-votes-servo

Tracking changes to the official U.S. House and Senate roll call votes XML data files. Monitored hourly-ish by @GovTrack/@JoshData.
HTML
33
star
19

data-seal

Data Seal is a lightweight, UELMA-compliant data authentication service.
HTML
32
star
20

licensing

Best practices language for making open government data "license-free".
HTML
27
star
21

rtyaml

All the annoying stuff we had to do to make YAML usable.
Python
27
star
22

congress-data

Legislative data from the congress repository
19
star
23

complaints

An index of formal complaint systems
17
star
24

wish-list

A wish list for this organization, open an Issue to discuss what we can add. Derived from a News Foo session.
16
star
25

domains

Organizing and publishing the web domains of the US federal government
16
star
26

petitions

White House petition crawler.
Python
15
star
27

data-releases

A listing of public data releases by federal agencies
15
star
28

BillMap

Utilities and applications for the FlatGov project by Demand Progress
JavaScript
14
star
29

legisworks-historical-statutes

Metadata and per-statute PDFs for the U.S. Statutes at Large through volume 64 (1789-1951).
Python
14
star
30

am_mem_law

Documentation & data for the Library of Congress American Memory Century of Lawmaking collection.
Python
12
star
31

agency-regions

A collection of data about how federal agencies divide their agency coverage geospatially
11
star
32

scotus-bound-volumes

11
star
33

chaplains

Text of prayers delivered by guest chaplains to House
Python
11
star
34

reports

Storage space for public US reports which need a place to go.
HTML
10
star
35

statements-of-administration-policy

An archive and scraper of White House Statements of Administration Policy
Python
9
star
36

nabors

Bill numbers for early American statutes based on Nabors's Legislative Reference Checklist book.
Python
8
star
37

congress-publish

Script to publish bill and amendment data as a JSON API.
Python
8
star
38

congress-calendar

A calendar of Congressional events, like committee meetings and votes
6
star
39

data-issues

(NO LONGER USED.)
3
star