Sunlight Congress API
This is the code that powers the Sunlight Foundation's Congress API.
Overview
The Congress API has two parts:
- A light front end, written in Ruby using Sinatra.
- A back end of data scraping and loading tasks. Most are written in Ruby, but Python tasks are also supported.
The front end is essentially read-only. Its job is to translate an API call (the query string) into a single database query (usually to MongoDB), wrap the resulting JSON in a bit of pagination metadata, and return it to the user.
Endpoints and behavior are determined by introspecting on the classes defined in models/
. These classes are also expected to define database indexes where applicable.
The front end tries to maintain as little model-specific logic as possible. There are a couple of exceptions made (like allowing disabling of pagination for /legislators
) — but generally, adding a new endpoint is as simple as adding a model class.
The back end is a set of tasks (scripts) whose job is to write data to the collections those models refer to. Most data is stored in MongoDB, but some tasks will store additional data in Elasticsearch, and some tasks may extract citations via a citation server.
We currently manage these tasks via cron. A small task runner wraps each script in order to ensure any "reports" created along the way get emailed to admins, to catch errors, and to parse command line options.
While the front end and back end are mostly decoupled, many of them do use the definitions in models/
to save data (via Mongoid) and to manage duplicating "basic" fields about objects onto other objects.
The API never performs joins -- if data from one collection is expected to appear as a sub-field on another collection, it should be copied there during data loading.
Setup - Dependencies
If you don't have Bundler, install it:
gem install bundler
Then use Bundler to install the Ruby dependencies:
bundle install --local
If you're going to use any of the Python-based tasks, install virtualenv and virtualenvwrapper, make a new virtual environment, and install the Python dependencies:
mkvirtualenv congress-api
pip install -r tasks/requirements.txt
Some tasks use PDF text extraction, which is performed through the docsplit gem. If you use a task that does this, you will need to install a system dependency, pdftotext
.
On Linux:
sudo apt-get install poppler-data
Or on OS X:
brew install poppler
Setup - Configuration
Copy the example config files:
cp config/config.yml.example config/config.yml
cp config/mongoid.yml.example config/mongoid.yml
cp config.ru.example config.ru`
You don't need to edit these to get started in development, the defaults should work fine.
In production, you may wish to turn on the API key requirement, and add SMTP server details so that mail can be sent to admins and task owners.
If you work for the Sunlight Foundation, and want it to sync analytics and API keys with HQ, you'll need to update the services
section with a shared_secret
.
Read the documentation in config.yml.example for a description of each element.
Setup - Services
You can get started by just installing MongoDB.
The Congress API depends on MongoDB, a JSONic document store, for just about everything. MongoDB can be installed via apt, homebrew, or manually.
Optional. Some tasks that index full text will require Elasticsearch, a JSONic full-text search engine based on Lucene. Elasticsearch can be installed via apt, or manually.
Optional. If you want citation parsing, you'll need to install citation, a Node-based citation extractor. After installing Node, you can install it with [sudo] npm -g install citation
, then run it via cite-server
on port 3000.
Optional. To perform location lookups, you'll need to point the API at an instance of pentagon, a boundary service. Sunlight uses an instance loaded with congressional districts and ZCTAs, so that we can look up legislators and districts by either latitude
/longitude
or zip
.
Starting the API
After installing dependencies and MongoDB, and copying the config files, boot the app with:
bundle exec unicorn
The API should return some enthusiastic JSON at http://localhost:8080
.
Specify --port
to use a port other than 8080.
Running tasks
The API uses rake
to run data loading tasks, and various other API maintenance tasks.
Every directory in tasks/
generates an automatic rake
task, like:
rake task:hearings_house
This will look in tasks/hearings_house/
for either a hearings_house.rb
or hearings_house.py
.
Ruby tasks should define a class named after the file, e.g. HearingsHouse
, with a class-level run
method that accepts a hash of options.
Python tasks should just define a run
method that accepts a dict of options.
Options will be read from the command line using env syntax, for example:
rake task:hearings_house month=2014-01
The options hash will also include an additional config
key that contains the parsed contents of config/config.yml
, so that tasks have access to API configuration details.
So rake task:hearings_house month=2014-01
will execute:
HearingsHouse.run({
month: "2014-01",
config: {
# ...parsed config.yml details...
}
})
Task files should define the options they accept at the top of the file, in comments, like so.
Task Reporting
Tasks can file "reports" as they operate. Reports will be stored in the database, and reports with certain status will be emailed to the admin and any task-specific owners (as configured in config.yml
).
Since this is MongoDB, any other useful data can simply be dumped onto the report document.
For example, a task might log warnings during its operation, and send a single warning email at the end:
if failures.any?
Report.failure self, "Failed to process #{failures.size} reports", {failures: failures}
end
(In this case, self
is the class of the task, e.g. GaoReports
.)
Emails will be sent when filing failure
or warning
reports. You can also store note
reports, and all tasks should file a success
report at the end if they were successful.
The system will automatically file a complete
report, with a record of how long a task took - tasks do not need to do this themselves.
Similarly, if an exception is raised during a task, the system will catch it and file (and email) a failure
report.
Any task that encounters an error or something worth warning about should file a warning
or failure
report during operation. After a task completes, the system will examine the reports collection for any "unread" warning
or failure
reports, send emails for each one, and mark them as "read".
Undocumented features
This API has some endpoints and features that are not included in the public documentation, but are used in Sunlight tools.
Endpoints
/regulations
- Material published in the Federal Register since 2009. Currently used in Scout.
/documents
- Reports from the Government Accountability Office, and various inspectors general since 2009. Currently used in Scout.
/videos
- Information on videos from the House floor and Senate floor, synced through the Granicus API. Currently used in Sunlight's Roku apps.
Citation detection
As bills, regulations, and documents are indexed into the system, they are first run through a citation extractor over HTTP.
Extracted citation data is stored locally, in Mongo, in a citations
collection, using the Citation
model. Excerpts of surrounding context are also stored then, at index-time.
The API accepts a citing
parameter, of one or more (pipe-delimited) citation IDs, in the format produced by unitedstates/citation. Passing citing
adds a filter (to either Mongo or Elasticsearch-based endpoints) of citation_ids__all
, which limits results to only documents for which all given citation IDs were detected at index-time.
If a citing.details
parameter is passed with a value of true
, then every returned result will trigger a quick database lookup for those associated citations for that document, and citation details (including the surrounding match context) will be added to that document as a citation
field.
For example, a search for:
/bills?citing=usc/5/552&citing.details=true&per_page=1&fields=bill_id
Might return something like:
{
"results": [
{
"bill_id": "s2141-113",
"citations": [
{
"type": "usc",
"match": "section 552(b) of title 5",
"index": 8624,
"excerpt": "disclosure pursuant to section 1905 of title 18, United States Code, section 552(b) of title 5, United States Code, or section 301(j) of this Act.",
"usc": {
"title": "5",
"section": "552",
"subsections": [],
"id": "usc/5/552",
"section_id": "usc/5/552"
}
}
]
}
]
}