• Stars
    star
    1,395
  • Rank 33,684 (Top 0.7 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 9 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Public Data Scraper for Parliament Data for the EU and other Parliaments

parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

  1. Install git (if not present already)
  2. Clone project using git clone https://github.com/fossasia/parliament-scaper.git
  3. Install Ruby (version >= 2.1) and Bundler
  4. Run bundle install to install the required gems
  5. Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
  6. Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

  1. Ruby - The Language
  2. Nokogiri - For HTML Parsing

Scala-based Asynchronous crawler Setup

  1. Install sbt, git and latest version of scala(sbt will do the update for you)
  2. git clone https://github.com/DengYiping/parliament-scaper.git
  3. sbt run
  4. sbt will first automatically download the necessary dependencies, and it will run the script.

Technologies Used in Scala crawler:

  1. Scala: a functional programming language on JVM
  2. Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
  3. Spray-client: a light-weighted HTTP client based on Akka Actor model.

Python Based Crawler Setup

  1. Install the requirements for this crawler pip install -r requirements.txt
  2. Run $ python eu_scraper.py

Technologies Used in Python Crawler:

  1. Requests library
  2. lxml library for DOM traversal

Python-async parser setup

  1. Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
  2. Activate you virtual environment with source venv/bin/activate
  3. Install all appropriate requirements with pip install -r requirements.txt
  4. Run the parser with $ python parser.py

Changing the parser behavior

  • Change YEARS_TO_PARSE in order to parse data from different years
  • Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

Technologies Used in Python-async parser:

  1. Requests + requests-futures for async requests
  2. threading for async downloading
  3. beautifulsoup4 for DOM parsing
  4. tqdm for progress bar

Python-Based Scraper (pol's scraper)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scraper.py

Scrape it all - Generic Scraper(pol's scraper 2)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of docs to be scraped.

Generic Scraper - All years, All languages. Scrapes entire database.

  1. Install the requirements pip install -r requirements.txt
  2. Run $ python scrape_it_all.py

More Repositories

1

bodyapps-viz

3D body visualizer component for #bodyapps project
JavaScript
1,501
star
2

open-spectrometer-hardware

Open source VIS spectrometer
Python
1,431
star
3

bodyapps-android

Bodyapps Measurement App
Java
1,394
star
4

ots15-companion

Opentech Event app
Java
1,384
star
5

libredesktop-meta

SoerenX-Plugin for Custom Search, Tiles, Snippets
1,383
star
6

mvisc

Mobile Visual Classification (MVISC) is a project to identify and classify animals.
HTML
1,377
star
7

opentechsummit.de

OpenTechSummit Germany https://opentechsummit.de
CSS
1,376
star
8

2015.opentechsummit.de

Website of OpenTechSummit 2015 http://2015.opentechsummit.de
CSS
1,375
star
9

opntec-artwork

OpnTec Artwork
HTML
1,373
star
10

2016.opentechsummit.de

Website of OpenTechSummit 2016 http://2016.opentechsummit.de
CSS
1,372
star
11

bodyapps-web

Web service and web application components of #bodyapps project
Ruby
1,371
star
12

open-spectrometer-python

Open Source Spectrometer Python Scripts
Python
1,369
star
13

2017.opentechsummit.de

OpenTechSummit 2017 http://2017.opentechsummit.de
CSS
1,369
star
14

2018.opentechsummit.de

OpenTechSummit 2018 http://2018.opentechsummit.de
CSS
1,369
star
15

open-event-android

Open Tech Events from around the world
Java
1,368
star
16

openxlab-artwork

OpenXlab Artwork
1,366
star
17

opentechsummit.eu

OpenTechSummit Europe https://opentechsummit.eu
Less
1,366
star
18

opentechsummit.in

OpenTechSummit India
HTML
1,365
star
19

hdf

human definition file format
1,363
star
20

parliament-scraper-artwork

Parliament Scraper Artwork
1,362
star
21

2015.opentechsummit.net

OpenTechSummit Meetups 2015 https://2015.opentechsummit.net
HTML
1,362
star
22

2019.opentechsummit.net

OpenTechSummit 2019 https://2019.opentechsummit.net
HTML
1,361
star
23

opentechsummit.net

OpenTechSummit https://opentechsummit.net
HTML
1,361
star
24

2020.opentechsummit.cn

OpenTechSummit Meetups 2010
HTML
1,361
star
25

thai.opentechsummit.asia

OpenTechSummit Thailand 2019 https://thai.opentechsummit.asia
CSS
1,360
star
26

2016.opentechsummit.net

OpenTechSummit Meetups 2016 https://2016.opentechsummit.net
HTML
1,357
star
27

2018.opentechsummit.net

OpenTechSummit 2018 http://2018.opentechsummit.net
HTML
1,357
star
28

2017.opentechsummit.net

OpenTechSummit Meetups 2017 https://2017.opentechsummit.net
HTML
1,355
star
29

opentechsummit.asia

OpenTechSummit Asia https://opentechsummit.asia
HTML
1,355
star
30

fashionmaker

Fashion Robot
1,331
star
31

open-event-server

Python
1,222
star
32

2019.opentechsummit.de

OpenTechSummit 2019 https://2019.opentechsummit.de
Less
1,051
star
33

2018.opentechsummit.cn

OpenTechSummit 2018 https://2018.opentechsummit.cn
CSS
750
star
34

vn.opentechsummit.asia

https://vn.opentechsummit.asia
CSS
713
star
35

2022.opentechsummit.de

OpenTechSummit 2022 https://2022.opentechsummit.de
HTML
637
star
36

pycon.world

Pycon World Conference Series https://pycon.world
611
star
37

vietnam.pycon.world

HTML
597
star
38

singapore.pycon.world

HTML
596
star
39

srilanka.pycon.world

CSS
595
star
40

indochina.pycon.world

HTML
593
star
41

srilanka.opentech.asia

CSS
593
star
42

germany.pycon.world

HTML
592
star
43

china.pycon.world

HTML
590
star
44

opentech.asia

588
star
45

myanmar.opentech.asia

588
star
46

thaiday.opentech.asia

HTML
588
star
47

malaysia.opentech.asia

586
star
48

japan.opentech.asia

HTML
586
star
49

vietnam.opentech.asia

HTML
584
star
50

indochina.opentech.asia

HTML
583
star
51

devopssg.opentech.asia

HTML
583
star
52

cloudsg.opentech.asia

HTML
582
star
53

aisg.opentech.asia

HTML
582
star
54

delhiday.opentechsummit.in

HTML
581
star
55

south.opentechsummit.in

HTML
579
star
56

austria.opentechsummit.eu

HTML
579
star
57

thai.opentech.asia

HTML
578
star
58

2019.opentechsummit.cn

CSS
568
star
59

pycon.cn

Pycon China https://pycon.cn
536
star
60

kiku.ai

https://kiku.ai
HTML
457
star
61

opentechsummit.cn

OpenTechSummit China http://opentechsummit.cn
CSS
426
star
62

mbm.vn

HTML
398
star
63

libregraphics.asia

263
star
64

openxlab

HTML
193
star
65

2023.opentechsummit.de

HTML
19
star