• Stars
    star
    1,402
  • Rank 32,465 (Top 0.7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Public Data Scraper for Parliament Data for the EU and other Parliaments

parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

  1. Install git (if not present already)
  2. Clone project using git clone https://github.com/fossasia/parliament-scaper.git
  3. Install Ruby (version >= 2.1) and Bundler
  4. Run bundle install to install the required gems
  5. Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
  6. Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

  1. Ruby - The Language
  2. Nokogiri - For HTML Parsing

Scala-based Asynchronous crawler Setup

  1. Install sbt, git and latest version of scala(sbt will do the update for you)
  2. git clone https://github.com/DengYiping/parliament-scaper.git
  3. sbt run
  4. sbt will first automatically download the necessary dependencies, and it will run the script.

Technologies Used in Scala crawler:

  1. Scala: a functional programming language on JVM
  2. Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
  3. Spray-client: a light-weighted HTTP client based on Akka Actor model.

Python Based Crawler Setup

  1. Install the requirements for this crawler pip install -r requirements.txt
  2. Run $ python eu_scraper.py

Technologies Used in Python Crawler:

  1. Requests library
  2. lxml library for DOM traversal

Python-async parser setup

  1. Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
  2. Activate you virtual environment with source venv/bin/activate
  3. Install all appropriate requirements with pip install -r requirements.txt
  4. Run the parser with $ python parser.py

Changing the parser behavior

  • Change YEARS_TO_PARSE in order to parse data from different years
  • Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

Technologies Used in Python-async parser:

  1. Requests + requests-futures for async requests
  2. threading for async downloading
  3. beautifulsoup4 for DOM parsing
  4. tqdm for progress bar

Python-Based Scraper (pol's scraper)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scraper.py

Scrape it all - Generic Scraper(pol's scraper 2)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of docs to be scraped.

Generic Scraper - All years, All languages. Scrapes entire database.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scrape_it_all.py

More Repositories

1

bodyapps-viz

3D body visualizer component for #bodyapps project
JavaScript
1,504
star
2

open-spectrometer-hardware

Open source VIS spectrometer
Python
1,435
star
3

bodyapps-android

Bodyapps Measurement App
Java
1,402
star
4

ots15-companion

Opentech Event app
Java
1,391
star
5

libredesktop-meta

SoerenX-Plugin for Custom Search, Tiles, Snippets
1,390
star
6

mvisc

Mobile Visual Classification (MVISC) is a project to identify and classify animals.
HTML
1,384
star
7

2015.opentechsummit.de

Website of OpenTechSummit 2015 http://2015.opentechsummit.de
CSS
1,382
star
8

opentechsummit.de

OpenTechSummit Germany https://opentechsummit.de
CSS
1,382
star
9

opntec-artwork

OpnTec Artwork
HTML
1,381
star
10

2016.opentechsummit.de

Website of OpenTechSummit 2016 http://2016.opentechsummit.de
CSS
1,379
star
11

bodyapps-web

Web service and web application components of #bodyapps project
Ruby
1,379
star
12

open-event-android

Open Tech Events from around the world
Java
1,376
star
13

2018.opentechsummit.de

OpenTechSummit 2018 http://2018.opentechsummit.de
CSS
1,376
star
14

2017.opentechsummit.de

OpenTechSummit 2017 http://2017.opentechsummit.de
CSS
1,376
star
15

open-spectrometer-python

Open Source Spectrometer Python Scripts
Python
1,374
star
16

openxlab-artwork

OpenXlab Artwork
1,374
star
17

opentechsummit.eu

OpenTechSummit Europe https://opentechsummit.eu
Less
1,373
star
18

opentechsummit.in

OpenTechSummit India
HTML
1,373
star
19

2015.opentechsummit.net

OpenTechSummit Meetups 2015 https://2015.opentechsummit.net
HTML
1,370
star
20

parliament-scraper-artwork

Parliament Scraper Artwork
1,370
star
21

hdf

human definition file format
1,370
star
22

2019.opentechsummit.net

OpenTechSummit 2019 https://2019.opentechsummit.net
HTML
1,369
star
23

opentechsummit.net

OpenTechSummit https://opentechsummit.net
HTML
1,369
star
24

2020.opentechsummit.cn

OpenTechSummit Meetups 2010
HTML
1,369
star
25

thai.opentechsummit.asia

OpenTechSummit Thailand 2019 https://thai.opentechsummit.asia
CSS
1,368
star
26

2018.opentechsummit.net

OpenTechSummit 2018 http://2018.opentechsummit.net
HTML
1,367
star
27

2016.opentechsummit.net

OpenTechSummit Meetups 2016 https://2016.opentechsummit.net
HTML
1,365
star
28

opentechsummit.asia

OpenTechSummit Asia https://opentechsummit.asia
HTML
1,364
star
29

2017.opentechsummit.net

OpenTechSummit Meetups 2017 https://2017.opentechsummit.net
HTML
1,363
star
30

fashionmaker

Fashion Robot
1,339
star
31

open-event-server

Python
1,230
star
32

2019.opentechsummit.de

OpenTechSummit 2019 https://2019.opentechsummit.de
Less
1,057
star
33

2018.opentechsummit.cn

OpenTechSummit 2018 https://2018.opentechsummit.cn
CSS
757
star
34

vn.opentechsummit.asia

https://vn.opentechsummit.asia
CSS
720
star
35

2022.opentechsummit.de

OpenTechSummit 2022 https://2022.opentechsummit.de
HTML
643
star
36

pycon.world

Pycon World Conference Series https://pycon.world
617
star
37

vietnam.pycon.world

HTML
603
star
38

singapore.pycon.world

HTML
602
star
39

srilanka.pycon.world

CSS
600
star
40

srilanka.opentech.asia

CSS
598
star
41

indochina.pycon.world

HTML
598
star
42

germany.pycon.world

HTML
597
star
43

china.pycon.world

HTML
595
star
44

opentech.asia

593
star
45

myanmar.opentech.asia

593
star
46

thaiday.opentech.asia

HTML
593
star
47

japan.opentech.asia

HTML
592
star
48

malaysia.opentech.asia

591
star
49

vietnam.opentech.asia

HTML
589
star
50

indochina.opentech.asia

HTML
588
star
51

devopssg.opentech.asia

HTML
588
star
52

cloudsg.opentech.asia

HTML
587
star
53

aisg.opentech.asia

HTML
587
star
54

delhiday.opentechsummit.in

HTML
585
star
55

south.opentechsummit.in

HTML
584
star
56

austria.opentechsummit.eu

HTML
584
star
57

thai.opentech.asia

HTML
583
star
58

2019.opentechsummit.cn

CSS
573
star
59

pycon.cn

Pycon China https://pycon.cn
541
star
60

kiku.ai

https://kiku.ai
HTML
462
star
61

opentechsummit.cn

OpenTechSummit China http://opentechsummit.cn
CSS
431
star
62

mbm.vn

HTML
403
star
63

libregraphics.asia

268
star
64

openxlab

HTML
195
star
65

2023.opentechsummit.de

HTML
17
star