• Stars
    star
    169
  • Rank 224,453 (Top 5 %)
  • Language
    Ruby
  • License
    Other
  • Created about 14 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.

Sunlight Congress API

This is the code that powers the Sunlight Foundation's Congress API.

Overview

The Congress API has two parts:

  • A light front end, written in Ruby using Sinatra.
  • A back end of data scraping and loading tasks. Most are written in Ruby, but Python tasks are also supported.

The front end is essentially read-only. Its job is to translate an API call (the query string) into a single database query (usually to MongoDB), wrap the resulting JSON in a bit of pagination metadata, and return it to the user.

Endpoints and behavior are determined by introspecting on the classes defined in models/. These classes are also expected to define database indexes where applicable.

The front end tries to maintain as little model-specific logic as possible. There are a couple of exceptions made (like allowing disabling of pagination for /legislators) — but generally, adding a new endpoint is as simple as adding a model class.

The back end is a set of tasks (scripts) whose job is to write data to the collections those models refer to. Most data is stored in MongoDB, but some tasks will store additional data in Elasticsearch, and some tasks may extract citations via a citation server.

We currently manage these tasks via cron. A small task runner wraps each script in order to ensure any "reports" created along the way get emailed to admins, to catch errors, and to parse command line options.

While the front end and back end are mostly decoupled, many of them do use the definitions in models/ to save data (via Mongoid) and to manage duplicating "basic" fields about objects onto other objects.

The API never performs joins -- if data from one collection is expected to appear as a sub-field on another collection, it should be copied there during data loading.

Setup - Dependencies

If you don't have Bundler, install it:

gem install bundler

Then use Bundler to install the Ruby dependencies:

bundle install --local

If you're going to use any of the Python-based tasks, install virtualenv and virtualenvwrapper, make a new virtual environment, and install the Python dependencies:

mkvirtualenv congress-api
pip install -r tasks/requirements.txt

Some tasks use PDF text extraction, which is performed through the docsplit gem. If you use a task that does this, you will need to install a system dependency, pdftotext.

On Linux:

sudo apt-get install poppler-data

Or on OS X:

brew install poppler

Setup - Configuration

Copy the example config files:

cp config/config.yml.example config/config.yml
cp config/mongoid.yml.example config/mongoid.yml
cp config.ru.example config.ru`

You don't need to edit these to get started in development, the defaults should work fine.

In production, you may wish to turn on the API key requirement, and add SMTP server details so that mail can be sent to admins and task owners.

If you work for the Sunlight Foundation, and want it to sync analytics and API keys with HQ, you'll need to update the services section with a shared_secret.

Read the documentation in config.yml.example for a description of each element.

Setup - Services

You can get started by just installing MongoDB.

The Congress API depends on MongoDB, a JSONic document store, for just about everything. MongoDB can be installed via apt, homebrew, or manually.

Optional. Some tasks that index full text will require Elasticsearch, a JSONic full-text search engine based on Lucene. Elasticsearch can be installed via apt, or manually.

Optional. If you want citation parsing, you'll need to install citation, a Node-based citation extractor. After installing Node, you can install it with [sudo] npm -g install citation, then run it via cite-server on port 3000.

Optional. To perform location lookups, you'll need to point the API at an instance of pentagon, a boundary service. Sunlight uses an instance loaded with congressional districts and ZCTAs, so that we can look up legislators and districts by either latitude/longitude or zip.

Starting the API

After installing dependencies and MongoDB, and copying the config files, boot the app with:

bundle exec unicorn

The API should return some enthusiastic JSON at http://localhost:8080.

Specify --port to use a port other than 8080.

Running tasks

The API uses rake to run data loading tasks, and various other API maintenance tasks.

Every directory in tasks/ generates an automatic rake task, like:

rake task:hearings_house

This will look in tasks/hearings_house/ for either a hearings_house.rb or hearings_house.py.

Ruby tasks should define a class named after the file, e.g. HearingsHouse, with a class-level run method that accepts a hash of options.

Python tasks should just define a run method that accepts a dict of options.

Options will be read from the command line using env syntax, for example:

rake task:hearings_house month=2014-01

The options hash will also include an additional config key that contains the parsed contents of config/config.yml, so that tasks have access to API configuration details.

So rake task:hearings_house month=2014-01 will execute:

HearingsHouse.run({
  month: "2014-01",
  config: {
    # ...parsed config.yml details...
  }
})

Task files should define the options they accept at the top of the file, in comments, like so.

Task Reporting

Tasks can file "reports" as they operate. Reports will be stored in the database, and reports with certain status will be emailed to the admin and any task-specific owners (as configured in config.yml).

Since this is MongoDB, any other useful data can simply be dumped onto the report document.

For example, a task might log warnings during its operation, and send a single warning email at the end:

if failures.any?
  Report.failure self, "Failed to process #{failures.size} reports", {failures: failures}
end

(In this case, self is the class of the task, e.g. GaoReports.)

Emails will be sent when filing failure or warning reports. You can also store note reports, and all tasks should file a success report at the end if they were successful.

The system will automatically file a complete report, with a record of how long a task took - tasks do not need to do this themselves.

Similarly, if an exception is raised during a task, the system will catch it and file (and email) a failure report.

Any task that encounters an error or something worth warning about should file a warning or failure report during operation. After a task completes, the system will examine the reports collection for any "unread" warning or failure reports, send emails for each one, and mark them as "read".

Undocumented features

This API has some endpoints and features that are not included in the public documentation, but are used in Sunlight tools.

Endpoints

/regulations - Material published in the Federal Register since 2009. Currently used in Scout. /documents - Reports from the Government Accountability Office, and various inspectors general since 2009. Currently used in Scout. /videos - Information on videos from the House floor and Senate floor, synced through the Granicus API. Currently used in Sunlight's Roku apps.

Citation detection

As bills, regulations, and documents are indexed into the system, they are first run through a citation extractor over HTTP.

Extracted citation data is stored locally, in Mongo, in a citations collection, using the Citation model. Excerpts of surrounding context are also stored then, at index-time.

The API accepts a citing parameter, of one or more (pipe-delimited) citation IDs, in the format produced by unitedstates/citation. Passing citing adds a filter (to either Mongo or Elasticsearch-based endpoints) of citation_ids__all, which limits results to only documents for which all given citation IDs were detected at index-time.

If a citing.details parameter is passed with a value of true, then every returned result will trigger a quick database lookup for those associated citations for that document, and citation details (including the surrounding match context) will be added to that document as a citation field.

For example, a search for:

/bills?citing=usc/5/552&citing.details=true&per_page=1&fields=bill_id

Might return something like:

{
  "results": [
    {
      "bill_id": "s2141-113",
      "citations": [
        {
          "type": "usc",
          "match": "section 552(b) of title 5",
          "index": 8624,
          "excerpt": "disclosure pursuant to section 1905 of title 18, United States Code, section 552(b) of title 5, United States Code, or section 301(j) of this Act.",
          "usc": {
            "title": "5",
            "section": "552",
            "subsections": [],
            "id": "usc/5/552",
            "section_id": "usc/5/552"
          }
        }
      ]
    }
  ]
}

License

This project is licensed under the GPL v3.

More Repositories

1

upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
HTML
1,614
star
2

guides

ProPublica's News App and Data Style Guides
1,163
star
3

compas-analysis

Data and analysis for 'Machine Bias'
Jupyter Notebook
600
star
4

weepeople

A typeface of people sillhouettes, to make it easy to build web graphics featuring little people instead of dots.
489
star
5

stateface

A typeface of U.S. state shapes to use in web apps.
HTML
359
star
6

timeline-setter

A tool to create HTML timelines from spreadsheets of events.
JavaScript
328
star
7

nyc-dna-software

The source code, acquired by ProPublica, for New York City's Forensic Statistical Tool.
C#
318
star
8

facebook-political-ads

Monitoring Facebook Political Ads
HTML
237
star
9

daybreak

A simple-dimple key value store for ruby.
HTML
236
star
10

landline

Simple SVG maps that work everywhere.
HTML
166
star
11

qis

Quick Instagram search tool
HTML
158
star
12

column-setter

Custom responsive grids in Sass that work in older browsers.
SCSS
130
star
13

Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
Python
121
star
14

politwoops-tweet-collector

Python workers that collect tweets from the twitter streaming api and track deletions
Python
120
star
15

simple-tiles

Simple tile generation for maps.
C
106
star
16

django-collaborative

ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.
Python
99
star
17

transcribable

Drop in crowdsourcing for your Rails app. Extracted from Free the Files.
Ruby
84
star
18

schooner-tk

A collection of (hopefully) useful utilities for working with satellite images.
C++
71
star
19

newsappmodel

Conceptual Model for News Applications
58
star
20

table-setter

Easy Peasy CSV to HTML
JavaScript
57
star
21

ilcampaigncash

Load Illinois political contribute and spending data efficiently
TSQL
57
star
22

congress-api-docs

Documentation for the ProPublica Congress API
HTML
54
star
23

campaign_cash

A Ruby client for interacting with ProPublica Campaign Finance API
Ruby
52
star
24

politwoops_sunlight

Politwoops web front end
CSS
44
star
25

data-institute-2019

Materials for the ProPublica Data Institute 2019
43
star
26

table-fu

A utility for spreadsheet-style handling of arrays (e.g. filtering, formatting, and sorting)
Ruby
35
star
27

fakenator

PHP
27
star
28

staffers

Interactive and searchable House staffer directory, based on House disbursement data.
HTML
26
star
29

data-institute-2018

For students of https://projects.propublica.org/graphics/ida-propublica-data-institute
26
star
30

vid-skim

Transcripts and commentary for long boring videos on YouTube!
Ruby
26
star
31

simpler-tiles

Ruby bindings for Simple Tiles
HTML
25
star
32

cookcountyjail2

A new version of the cook county jail scraper, inspired by the Supreme Chi-Town Coding Crew
HTML
23
star
33

disbursements

Data and scripts relating to the publishing of the House expenditure reports, and hopefully the Senate's in future.
Ruby
23
star
34

pixel-pong

Interface and data-collection backend for PixelPing.
JavaScript
22
star
35

propertyassessments

Analysis behind the "How the Cook County Assessor Failed Taxpayers"
R
22
star
36

data-nicar-2019

Nicar ML/NLP workshop by J Kao
Jupyter Notebook
19
star
37

thinner

Slow purges for varnish useful on app deploys.
Ruby
17
star
38

data-institute-2021

13
star
39

transcript-audio-sync

Tools for synchronizing audio and text on a webpage
JavaScript
12
star
40

pac-donor-similarity

Cosine similarity scores for PAC donors to federal candidates
10
star
41

capitol_words_nlp

Experimenting with parsing the congressional record using NLP techniques and tools
Python
9
star
42

redactor

Tool to remove email addresses, person entities, and phone numbers from a text
Python
9
star
43

auditData

data and scripts for https://projects.propublica.org/graphics/eitc-audit
R
9
star
44

il-tickets-notebooks

Explore Chicago ticket data.
Jupyter Notebook
9
star
45

il-ticket-loader

Load and analyze Chicago parking and camera ticket data
Jupyter Notebook
7
star
46

fbpac-api-public

API supporting more complex queries on the database of ads gathered by github.com/propublica/facebook-political-ads
Ruby
6
star
47

data-institute-2022

6
star
48

collaborative-playbook

5
star
49

northern-il-federal-gun-cases

Jupyter Notebook
5
star
50

d4dPartD-analysis

analysis of doctors' promotional payments from drug companies and their prescribing behavior
R
4
star
51

table-setter-generator

A rails generator for table-setter
JavaScript
4
star
52

institute-files

Data Institute Lessons
4
star
53

pentagon

CartoCSS
3
star
54

campaign-finance-api-docs

Documentation for campaign finance API
3
star
55

collaborative-playbook-pt

Collaborative Playbook in Portuguese
2
star
56

vital-signs-hackathon

1
star
57

political-ad-collector

web landing page for propublica's political ad collector
CSS
1
star