parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

Install git (if not present already)
Clone project using git clone https://github.com/fossasia/parliament-scaper.git
Install Ruby (version >= 2.1) and Bundler
Run bundle install to install the required gems
Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

Ruby - The Language
Nokogiri - For HTML Parsing

Scala-based Asynchronous crawler Setup

Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.git
sbt run
sbt will first automatically download the necessary dependencies, and it will run the script.

Technologies Used in Scala crawler:

Scala: a functional programming language on JVM
Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
Spray-client: a light-weighted HTTP client based on Akka Actor model.

Python Based Crawler Setup

Install the requirements for this crawler pip install -r requirements.txt
Run $ python eu_scraper.py

Technologies Used in Python Crawler:

Requests library
lxml library for DOM traversal

Python-async parser setup

Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
Activate you virtual environment with source venv/bin/activate
Install all appropriate requirements with pip install -r requirements.txt
Run the parser with $ python parser.py

Changing the parser behavior

Change YEARS_TO_PARSE in order to parse data from different years
Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

Technologies Used in Python-async parser:

Requests + requests-futures for async requests
threading for async downloading
beautifulsoup4 for DOM parsing
tqdm for progress bar

Python-Based Scraper (pol's scraper)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.

Install the requirements pip install -r requirements.txt
Run $ python scraper.py

Scrape it all - Generic Scraper(pol's scraper 2)

Generic Scraper - All years, All languages. Scrapes entire database.

Install the requirements pip install -r requirements.txt
Run $ python scrape_it_all.py

OpnTec/parliament-scraper

OpnTec

Reviews

Repository Details