• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    Jupyter Notebook
  • Created almost 8 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scraped data from the 2016 U.S. Election (President, Senate, House, Governor) and primaries, ballot measures and exit polls

United States 2016 Election results (President, Senate, House, Governor), plus Primaries, Ballot Measures and Exit Polls

These are csvs of results scraped from Politico and (for exit polls) CNN websites. If anyone uses this, I'd love to hear about it! I wrote a brief blog post describing my motivation

For more about the ethics of scraping, see this Quora post. I used Selenium, which automated the Google Chrome browser, so I put no more load on their servers than a normal visitor. The only thing the program did faster than a human is parse the results.

I should address something here: you never *need* Selenium to webscrape, I used it because I 
wanted the results quickly, and there was not much actual data, so the rate limiting step 
was my programmer time. In addition, the code is kind of, well, haphazardly written for the
same reasons. At the same time I did my best to include things like log files to make sure
I didn't go off the rails and to ensure that the results were as accurate as I was able to
produce, which I think is the main point of the exercise.

Note Selenium was handy because, in the case of Politico, the Outer HTML only loaded completely when the browser reached the bottom of the page, so a scraper that could "tell" the website it was really a browser and had scrolled all the way down was necessary. (Again, I could have intercepted the JSON requests, but low-hanging-fruit, Selenium is so easy and I have the API memorized. I'm making such a fuss about this because there's a whole class of beginning web scrapers out there who use Selenium as a webscraping tool when they don't need to, it's a testing tool.)

I make no warranty as to the accuracy of the data, either of the source data itself or of my scraping (although I did my best and tried to validate the completeness of my results). Please contact me if you find any issues not listed here.

I scraped most of the info from Politico, although the exit polls were from CNN. They include:

  • Both General Election and Primaries
  • Presidential, Senate, House and Governor races
  • Ballot Measures
  • Exit Polls
  • By State and by County, where possible.

Further downballot races like state legislatures were not listed.

DATA DICTIONARY:
  • ballot_measure_type: There are 'key ballot measures' and 'state ballot measures'; I am unclear as to the difference between them. Those with the word 'Overall' in this field are key ballot measures.
  • choice: The vote choice for a ballot measure, usually 'yes' or 'no'.
  • delegates: The number of delegates awarded to the winner of a primary. It is null for candidates who did not win the primary. (It would have made sense to replace these nulls with zeroes, but I didn't think of it in time.)
  • fips: The FIPS county code is a five-digit Federal Information Processing Standard (FIPS) code (FIPS 6-4) which uniquely identifies counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.
  • geo_name: State name, District name or County name, as appropriate
  • individual_party: For general elections or open primaries, the party of the candidate. For closed primaries, this is blank because all the candidates belong to the same party, which is shown in the party field.
  • is_incumbent: Whether the candidate is an incumbent. The presidential race had no incumbents.
  • is_winner: Whether that candidate won that individual race; in some primary caucuses, where delegates are given out to multiple candidates, even if one candidate receives the most, there is no winner declared per se, so the value of this field is null instead of True or False.
  • name: Name of candidate
  • party: For races in which only one party participated, e.g. closed primaries, the name of that party; otherwise blank
  • rank: 1, 2, 3... signifying a first, second, third... place finish in that individual race
  • reporting_pct: percentage of votes cast that have been reported
  • state: full name of state, or District of Columbia
  • summary: text summary of the ballot measure
  • title: official, short title of the ballot measure
  • vote_pct: percentage of the vote cast in that race that went to that candidate
  • votes: number of votes that went to that candidate

The following are known issues with the data as present on Politico (either the data just does not exist/is not reported, or it exists and for whatever reason, Politico does not have it):

  • Alaska has no Presidential General Election listed by County; perhaps Politico was confused by the fact that they call their counties "Boroughs", although they had no problem with Louisiana's "Parishes"
  • Between the primaries and the election, Shannon County (South Dakota) changed its name to Oglala Lakota County
  • Missing Kansas Presidential Primary by county
  • Missing North Dakota Presidential Primary by county
  • Colorado, Wyoming and Maine Republican primary results are not reported
  • No page exists for Louisiana Senate or House Primary
  • Missing Illinois 5th district Republican House Primary
  • Politico has a "missing something" box in Illinois 12th district House Primary, but both Democratic and Republican primary results are present.
  • Politico's page for Virginia House Primary lists individual district results for only the 2nd, 4th, 5th and 6th districts.
  • Minnesota's Presidential Primary is reported by congressional district, while its Presidential General Election is reported by county
  • Washington's open primaries for senate and house means that by county or district, the primaries are not divided between political parties

If merging this data with county demographic data, be aware of the following:

  • The aforementioned Shannon County change to Oglala Lakota County in South Dakota; fips has changed from 46113 to 46102
  • In 2015, the county of Bedford City, VA, fips 51515, was amalgamated into Bedford County, VA, fips 51019. There is no more fips 51515, while fips 51019 now has greater area and population (and different demographics)
  • Kalawao County, Hawaii, fips 15005, has a population of 90 so they count in a neighboring county's totals (I could not determine which one)
  • As mentioned above, for some reason Alaska's Presidential General Election was not listed per county, and the Presidential Primaries was listed per Congressional electoral district instead of per county. So Alaska has no county data in this dataset. Blame Politico.

I have merged this data, given the caveats above, with Deleetdk's USA.county.data repo; the results are in /merged_with_demog. Note that the demog info also has some political history info, in columns (wide), while my election data is tall/normalized; if you want to compare between elections, that will take some wrangling.

Note that if there is only one candidate with is_winner == True and votes == NaN, they ran uncontested.

A word about the individual_party and party fields. party is filled out when the entire subtable is for one party, e.g. a primary. individual_party is filled out when an individual line showing a candidate lists a political party.

Many of Politico's third-part names were wrong, so the only parties named are "Republican", "Democratic" and "Independent/Other".

Note that the number of votes is reported as a float instead of an integer due to a particularity of the pandas/pydata ecosystem: since they contain NaNs (the uncontested winners mentioned in the previous note), and NaNs are dtype float, the column cannot be integer.

For the curious, the Jupyter notebooks I used to scrape the data are in the notebooks folder. They're not beautiful, but they work. The politico one, in particular, has lots of checks for consistency, both internally and with the demographic data mentioned above.