• This repository has been archived on 03/May/2022
  • Stars
    star
    121
  • Rank 293,866 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 14 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date

Capitol Words NG - The Upgrade

This repository is no longer maintained and exists for archival purposes only.

See archival README information.

Developed by Chartbeat Engineers during hackweeks!

The API

The API is built on Django and it's broken into the following apps:

* cwapi - document search with elastic search
* legislators - database of congress people
* scraper - service to scrape CREC documents from gpo.gov and stage them in s3
* parser - service to extract metadata from CREC documents and upload to elasticsearch and rds

Setup

Most of the ops related settings are setup via the django settings module.

elasticsearch

For the elasticsearch cluster, ES_URL and ES_CW_INDEX need to contain the url with port for an elasticsearch cluster and the name of an index (this is used for both reading with the API and uploading to with the parser).

CapitolWords is written against Elasticsearch 5.4. To spin up a local instance of Elasticsearch for development, run the following command:

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:5.4.3

You may need to increase Docker's default memory limit from 2GB to 4GB to use Elasticsearch.

cwapi

To set up the API do the following:

    cd Capitol-Words/capitolweb
    python manage.py migrate
    python manage.py loadcongress
    python manage.py createsuperuser (if you want to use the admin tool to browse)
    python manage.py runserver

The search API is backed by elastic search and uses elasticsearch-dsl-py

There are three endpoints supported:

  1. cwapi/term_counts_by_day/
    • Counts the occurrences of a provided term in all CREC docs over a given time period and returns the total count per day over that date range.
    • Args:
      • start_date: A timestamp for the start of a date range filter, format: YYYY-MM-DD (inclusive).
      • end_date: End of date range filter, format: YYYY-MM-DD (inclusive).
      • days_ago: Number of days before the current date to retrieve results for.
      • term: A search term (or terms, space separated) to count occurrences for.
    • Example:
      • Request: http://localhost:8000/cwapi/term_counts_by_day/?term=russia&start_date=2017-03-01&end_date=2017-03-30
      • Response:
        {
          "status": "success",
          "data": {
            "daily_counts": {
              "2017-03-01": 5,
              "2017-03-02": 45,
              "2017-03-03": 0,
              "2017-03-04": 0,
              "2017-03-05": 0,
              "2017-03-06": 1,
              "2017-03-07": 2,
              "2017-03-08": 3,
              "2017-03-09": 1,
              "2017-03-10": 0,
              "2017-03-11": 0,
              "2017-03-12": 0,
              "2017-03-13": 1,
              "2017-03-14": 0,
              "2017-03-15": 20,
              "2017-03-16": 3,
              "2017-03-17": 0,
              "2017-03-18": 0,
              "2017-03-19": 0,
              "2017-03-20": 1,
              "2017-03-21": 0,
              "2017-03-22": 3,
              "2017-03-23": 33,
              "2017-03-24": 0,
              "2017-03-25": 0,
              "2017-03-26": 0,
              "2017-03-27": 10,
              "2017-03-28": 25,
              "2017-03-29": 29,
              "2017-03-30": 26
            }
          }
        }
  2. cwapi/search/
    • Performs elasticsearch match queries using the provided terms.
    • Args:
      • start_date: A timestamp for the start of a date range filter, format: YYYY-MM-DD (inclusive).
      • end_date: End of date range filter, format: YYYY-MM-DD (inclusive).
      • days_ago: Number of days before the current date to retrieve results for.
      • title: Search for docs with this value in the "title" of the CREC doc.
      • speaker: Search for docs with this speaker listed in the CREC doc.
      • content: Search for docs that contain this value in the body of the CREC doc.
    • Example:
      • Request: http://localhost:8000/cwapi/search/?content=russia&start_date=2017-03-01&end_date=2017-03-30&size=1
      • Response:
        {
            "status": "success",
            "data": [
                {
                    "title": "RUSSIA AND TRUMP CAMPAIGN INVESTIGATION",
                    "title_part": "Senate",
                    "date_issued": "2017-03-30",
                    "content": "...",
                    "crec_id": "id-CREC-2017-03-30-pt1-PgS2140",
                    "pdf_url": "https://www.gpo.gov/fdsys/pkg/CREC-2017-03-30/pdf/CREC-2017-03-30-pt1-PgS2140.pdf",
                    "html_url": "https://www.gpo.gov/fdsys/pkg/CREC-2017-03-30/html/CREC-2017-03-30-pt1-PgS2140.htm",
                    "page_start": "S2140",
                    "page_end": "S2141",
                    "speakers": "Tom Udall,Jon Tester",
                    "segments": [
                        {
                            "id": "id-CREC-2017-03-30-pt1-PgS2140-1",
                            "text": "RUSSIA AND TRUMP CAMPAIGN INVESTIGATION. "
                        },
                        {
                            "id": "id-CREC-2017-03-30-pt1-PgS2140-2",
                            "speaker": "Mr. UDALL",
                            "text": "...",
                            "bioguide_id": "U000039"
                        },
                        {
                            "id": "id-CREC-2017-03-30-pt1-PgS2140-3",
                            "speaker": "The PRESIDING OFFICER",
                            "text": "..."
                        },
                        {
                            "id": "id-CREC-2017-03-30-pt1-PgS2140-4",
                            "speaker": "Mr. TESTER",
                            "text": "...",
                            "bioguide_id": "T000464"
                        },
                        {
                            "id": "id-CREC-2017-03-30-pt1-PgS2140-5",
                            "speaker": "The PRESIDING OFFICER",
                            "text": "..."
                        }
                    ],
                    "score": 7.5042105
                },
            ]
        }
  3. cwapi/count
    • Uses the same methods as #1 and #2 to return the collected data for a search results page. This includes daily histograms for the current period and the previous one, as well as one page of the documents themselves. Currently only supports searching the "content" field of the CREC docs.
    • Args:
      • start_date: A timestamp for the start of a date range filter, format: YYYY-MM-DD (inclusive).
      • end_date: End of date range filter, format: YYYY-MM-DD (inclusive).
      • days_ago: Number of days before the current date to retrieve results for.
      • term: A search term (or terms, space separated) to search by (in the "content" field) and count occurrences.

legislators api

The legislators are pulled from the unitedstates/congress-legislators project and stored in the database.

Endpoints:

/legislators/search
/legislators/person/<bioguide_id>
/legislators/current?state=<state 2 letter code>

person allows lookup by the bioguide id current returns all of the current legislators with an optional 2 letter state code search allows for more complex queries:

    Search by query params
    supports the following ?term=val
    - id - the db id
    - name - matches against the official_full
    - last_name - last_name
    - gender
    - religion

    additionally supports boolean current to match only current reps

    example:
        http://127.0.0.1:8000/legislators/search/?gender=F&religion=Jewish
        http://127.0.0.1:8000/legislators/search/?gender=F&religion=Jewish&current

The record returned includes Term objects for every term served by the Congress Person along with bio data and ids to other databases and services.

Frontend (single-page javascript app)

The frontend is built in Javascript using a React/Redux stack. The project is in the frontend/capitolwords-spa directory and was initially created using the create-react-app tool (https://github.com/facebookincubator/create-react-app), so it should be easily updateable to newer frontend best-practices by following the upgrade path outlined in create-react-app's documentation (see the README.md in the capitolwords-spa directory or read the docs on their site).

To run the frontend app, make sure you have a recent version of node installed (this is tested with node 6.9.3). I'm using yarn instead of npm since that seems to be the norm for create-react-app. You can use npm (installed automatically with node) instead of yarn if you want. Just replace yarn with npm in the commands below. Do the following to get up and running:

bash
cd frontend/capitolwords-spa
yarn install # This installs all frontend dependencies.
yarn start   # This runs the frontend development server on port 3000.

The frontend app depends on the APIs, so you'll need to also be running the django-based API as outlined above. The frontend development server is setup to proxy all API requests to http://localhost:8000 -- if your django is not running on that port, or you want to proxy somewhere else, you should be able to change this by altering the proxy setting in the frontend/capitolwords-spa/package.json file. Note that you'll have to restart the frontend development server for that change to take effect.

Data Pipeline

Data ingestion consists of a scraper that pulls in CREC data from gpo.gov then stages it in S3 and a parser that reads that staged S3 data extracts some additional metadata then uploads those documents to elasticsearch.

The staging location in S3 is determined by the following settings in the main capitolweb settings file:

* `CREC_STAGING_S3_BUCKET`: Name of S3 bucket to stage files in.
* `CREC_STAGING_S3_ROOT_PREFIX`: All S3 keys for files will be prefixed with this value.

The key format is a combination of the value of CREC_STAGING_S3_ROOT_PREFIX, the date the files were recorded for ("dateIssued" in the mods.xml metadata), and the filename within the gpo.gov zip file.

scraper

All CREC docs for a given day are available as a zip file from gpo.gov. Metadata for these docs is available in a single xml file, mods.xml, contained within that zip file.

The scraper takes a date range and pulls in each zip file for every day in that range (days when congress was not in session are ignored). Its run through a custom django manage command:

./manage.py run_crec_scraper --start_date=2016-01-20 --end_date=2016-01-21

parser

The parser looks up the mods.xml file in the staged S3 data and extracts the metadata specific to each CREC document. It also does some NLP analysis of the content of each document. The resulting parsed data is uploaded to elasticsearch and the django configured db.

Example usage:

./manage.py run_crec_parser --start_date=2016-01-20 --end_date=2016-01-21

Tests

Most of the test cases included are integration tests that required a live elasticsearch cluster configured in the django settings file. A separate test index is defined in the capitolweb.settings_test module. manage.py has been modified to ensure that running ./manage.py test will override any environment variable settings and always use the test settings.

You should be able run the scraper and parser tests without any external dependencies as they rely on moto to mock out S3 operations.

More Repositories

1

upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
HTML
1,614
star
2

guides

ProPublica's News App and Data Style Guides
1,163
star
3

compas-analysis

Data and analysis for 'Machine Bias'
Jupyter Notebook
600
star
4

weepeople

A typeface of people sillhouettes, to make it easy to build web graphics featuring little people instead of dots.
489
star
5

stateface

A typeface of U.S. state shapes to use in web apps.
HTML
359
star
6

timeline-setter

A tool to create HTML timelines from spreadsheets of events.
JavaScript
328
star
7

nyc-dna-software

The source code, acquired by ProPublica, for New York City's Forensic Statistical Tool.
C#
318
star
8

facebook-political-ads

Monitoring Facebook Political Ads
HTML
237
star
9

daybreak

A simple-dimple key value store for ruby.
HTML
236
star
10

sunlight-congress

The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.
Ruby
169
star
11

landline

Simple SVG maps that work everywhere.
HTML
166
star
12

qis

Quick Instagram search tool
HTML
158
star
13

column-setter

Custom responsive grids in Sass that work in older browsers.
SCSS
130
star
14

politwoops-tweet-collector

Python workers that collect tweets from the twitter streaming api and track deletions
Python
120
star
15

simple-tiles

Simple tile generation for maps.
C
106
star
16

django-collaborative

ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.
Python
99
star
17

transcribable

Drop in crowdsourcing for your Rails app. Extracted from Free the Files.
Ruby
84
star
18

schooner-tk

A collection of (hopefully) useful utilities for working with satellite images.
C++
71
star
19

newsappmodel

Conceptual Model for News Applications
58
star
20

table-setter

Easy Peasy CSV to HTML
JavaScript
57
star
21

ilcampaigncash

Load Illinois political contribute and spending data efficiently
TSQL
57
star
22

congress-api-docs

Documentation for the ProPublica Congress API
HTML
54
star
23

campaign_cash

A Ruby client for interacting with ProPublica Campaign Finance API
Ruby
52
star
24

politwoops_sunlight

Politwoops web front end
CSS
44
star
25

data-institute-2019

Materials for the ProPublica Data Institute 2019
43
star
26

table-fu

A utility for spreadsheet-style handling of arrays (e.g. filtering, formatting, and sorting)
Ruby
35
star
27

fakenator

PHP
27
star
28

staffers

Interactive and searchable House staffer directory, based on House disbursement data.
HTML
26
star
29

data-institute-2018

For students of https://projects.propublica.org/graphics/ida-propublica-data-institute
26
star
30

vid-skim

Transcripts and commentary for long boring videos on YouTube!
Ruby
26
star
31

simpler-tiles

Ruby bindings for Simple Tiles
HTML
25
star
32

cookcountyjail2

A new version of the cook county jail scraper, inspired by the Supreme Chi-Town Coding Crew
HTML
23
star
33

disbursements

Data and scripts relating to the publishing of the House expenditure reports, and hopefully the Senate's in future.
Ruby
23
star
34

pixel-pong

Interface and data-collection backend for PixelPing.
JavaScript
22
star
35

propertyassessments

Analysis behind the "How the Cook County Assessor Failed Taxpayers"
R
22
star
36

data-nicar-2019

Nicar ML/NLP workshop by J Kao
Jupyter Notebook
19
star
37

thinner

Slow purges for varnish useful on app deploys.
Ruby
17
star
38

data-institute-2021

13
star
39

transcript-audio-sync

Tools for synchronizing audio and text on a webpage
JavaScript
12
star
40

pac-donor-similarity

Cosine similarity scores for PAC donors to federal candidates
10
star
41

capitol_words_nlp

Experimenting with parsing the congressional record using NLP techniques and tools
Python
9
star
42

redactor

Tool to remove email addresses, person entities, and phone numbers from a text
Python
9
star
43

auditData

data and scripts for https://projects.propublica.org/graphics/eitc-audit
R
9
star
44

il-tickets-notebooks

Explore Chicago ticket data.
Jupyter Notebook
9
star
45

il-ticket-loader

Load and analyze Chicago parking and camera ticket data
Jupyter Notebook
7
star
46

fbpac-api-public

API supporting more complex queries on the database of ads gathered by github.com/propublica/facebook-political-ads
Ruby
6
star
47

data-institute-2022

6
star
48

collaborative-playbook

5
star
49

northern-il-federal-gun-cases

Jupyter Notebook
5
star
50

d4dPartD-analysis

analysis of doctors' promotional payments from drug companies and their prescribing behavior
R
4
star
51

table-setter-generator

A rails generator for table-setter
JavaScript
4
star
52

institute-files

Data Institute Lessons
4
star
53

pentagon

CartoCSS
3
star
54

campaign-finance-api-docs

Documentation for campaign finance API
3
star
55

collaborative-playbook-pt

Collaborative Playbook in Portuguese
2
star
56

vital-signs-hackathon

1
star
57

political-ad-collector

web landing page for propublica's political ad collector
CSS
1
star