• Stars
    star
    164
  • Rank 229,995 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created about 9 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code to transform Hillary's emails from raw PDF documents to a SQLite database

hillary-clinton-emails

This is a work in progress - any help normalizing and extracting this data's much appreciated!

This repo contains code to transform Hillary Clinton's emails released through the FOIA request from raw PDF documents to CSV files and a SQLite database, making it easier to understand and analyze the documents.

A zip of the extracted data is available for download on Kaggle.

Check out some analytics on this data on Kaggle Scripts.

Note that conversion is very imprecise: there's plenty of room to improve the PDF conversion, the sender/receiver extraction, and the body text extraction.

Extracted data

There are five main output files this produces: four CSV files and one SQLite database.

Note that each table contains a numeric Id column. This Id column is only meant to be used to join the tables: it is internally consistent, but each entity may have a different Id when the data's updated.

Emails.csv

This file currently contains the following fields:

  • Id - unique identifier for internal reference
  • DocNumber - FOIA document number
  • MetadataSubject - Email SUBJECT field (from the FOIA metadata)
  • MetadataTo - Email TO field (from the FOIA metadata)
  • MetadataFrom - Email FROM field (from the FOIA metadata)
  • SenderPersonId - PersonId of the email sender (linking to Persons table)
  • MetadataDateSent - Date the email was sent (from the FOIA metadata)
  • MetadataDateReleased - Date the email was released (from the FOIA metadata)
  • MetadataPdfLink - Link to the original PDF document (from the FOIA metadata)
  • MetadataCaseNumber - Case number (from the FOIA metadata)
  • MetadataDocumentClass - Document class (from the FOIA metadata)
  • ExtractedSubject - Email SUBJECT field (extracted from the PDF)
  • ExtractedTo - Email TO field (extracted from the PDF)
  • ExtractedFrom - Email FROM field (extracted from the PDF)
  • ExtractedCc - Email CC field (extracted from the PDF)
  • ExtractedDateSent - Date the email was sent (extracted from the PDF)
  • ExtractedCaseNumber - Case number (extracted from the PDF)
  • ExtractedDocNumber - Doc number (extracted from the PDF)
  • ExtractedDateReleased - Date the email was released (extracted from the PDF)
  • ExtractedReleaseInPartOrFull - Whether the email was partially censored (extracted from the PDF)
  • ExtractedBodyText - Attempt to only pull out the text in the body that the email sender wrote (extracted from the PDF)
  • RawText - Raw email text (extracted from the PDF)

Persons.csv

  • Id - unique identifier for internal reference
  • Name - person's name

Aliases.csv

  • Id - unique identifier for internal reference
  • Alias - text in the From/To email fields that refers to the person
  • PersonId - person that the alias refers to

EmailReceivers.csv

  • Id - unique identifier for internal reference
  • EmailId - Id of the email
  • PersonId - Id of the person that received the email

database.sqlite

This SQLite database contains all of the above tables (Emails, Persons, Aliases, and EmailReceivers) with their corresponding fields. You can see the schema and ingest code under scripts/sqlImport.sql

Contributing: next steps

  • Improve the From/To address extraction mechanisms
  • Normalize various email address representations to people
  • Improve the BodyText extraction

Running the download and extraction code

Running make all in the root directory will download the data (~162mb total) and create the output files, assuming you have all the requirements installed.

Requirements

This has only been tested on OS X, it may or may not work on other operating systems.

  • python3
    • pandas
    • arrow
    • numpy
  • pdftotext (utility to transform a PDF document to text)
  • GNU make
  • sqlite3

References

The source PDF documents for this repo were downlaoded from the WSJ Clinton Inbox search.

I created this project before I realized the WSJ also open-sourced some code they used to create the Inbox Search. Subsequently, I've included some material from their open source project as well: I used their HRCEMAIL_names.csv to seed alias_person.csv. I also scraped metadata from foia.state.gov in a similar fashion as they did in downloadMetadata.py.

More Repositories

1

Metrics

Machine learning evaluation metrics, implemented in Python, R, Haskell, and MATLAB / Octave
Python
1,606
star
2

MachineLearning.jl

Julia Machine Learning library
Julia
116
star
3

Air-Quality-Prediction-Hackathon-Winning-Model

Contains the code for the model that won Kaggle's Air Quality Prediction Hackathon
MATLAB
89
star
4

FacebookRecruitingCompetition

Code to create benchmarks for Kaggle's Facebook Recruiting Competition
Python
84
star
5

Stack-Overflow-Competition

Benchmarks for Kaggle's Predict Closed Questions on Stack Overflow competition
Python
56
star
6

ASAP-AES

Evaluation Metrics for the Hewlett Foundation's Automated Essay Scoring competition
Python
37
star
7

ExpediaPersonalizedSortCompetition

Transformation and benchmark code for Expedia's Personalized Sort Kaggle Competition
Python
36
star
8

Kdd2013AuthorPaperIdentification

Benchmark and sample code for the Author Paper Identification Challenge on Kaggle, a part of the 2013 KDD Cup
Python
34
star
9

JobSalaryPrediction

Python
34
star
10

nips-papers

Python
29
star
11

CauseEffectPairsChallenge

Benchmark and sample code for the Cause Effect Pairs Challenge on Kaggle
Python
21
star
12

BioResponse

Benchmarks for Boehringer-Ingelheim's Predicting a Biological Response Competition, hosted by Kaggle
Python
17
star
13

BluebookForBulldozers

Python
16
star
14

2016-us-election

Makefile
16
star
15

GEFlightQuest

Data transformation code and benchmarks for GE Flight Quest
Python
13
star
16

crowdflower-airline-twitter-sentiment

Makefile
12
star
17

nips-2015-papers

Python
11
star
18

ASAP-SAS

Sample code for the short answer scoring contest, hosted by Kaggle
Python
9
star
19

emvic

Kaggle's Eye Movements Identification and Verification Competition
Python
8
star
20

EventRecommendationChallenge

Sample code for Kaggle's Event Recommendation Challenge
Python
8
star
21

crowdflower-first-gop-debate-twitter-sentiment

Makefile
7
star
22

icml2013preview

ICML 2013 Accepted Papers Preview
Python
7
star
23

health-insurance-marketplace

Makefile
7
star
24

us-college-scorecard

Python
6
star
25

GEFlight2BasicAgents

Code to create a simple sample submission for GE Flight Quest Phase 2: Flight Optimization
Python
5
star
26

baseball

Makefile
5
star
27

awic2012

Benchmark for Kaggle's ICFHR 2012 Arabic Writer Identification challenge
4
star
28

sf-salaries

Python
4
star
29

XSV.jl

CSV, TSV, etc. streaming and batch parser
Julia
4
star
30

Koalas

C# DataFrame
C#
4
star
31

icdm-2015-drawbridge-cross-device-connection

R
3
star
32

GribDotNet

FSharp library for reading Grib2 files
F#
3
star
33

ReinforcementLearning.jl

Julia
3
star
34

benhamner.github.com

HTML
2
star
35

ScikitLearnTutorial

Initially for DSL meetup
Python
2
star
36

deepdream

Shell
1
star
37

Ben-s-Notes

1
star
38

snap-amazon-fine-foods

Python
1
star
39

hacker-news-scrape

Python
1
star
40

march-machine-learning-mania-2016

Makefile
1
star
41

bay-area-bike-share

Python
1
star
42

DetectingFacesFromMegBrainActivity

Detecting Faces From Meg Brain Activity
Julia
1
star
43

docker-gym

1
star