• This repository has been archived on 11/Sep/2019
  • Stars
    star
    102
  • Rank 335,584 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 10 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python library with common functionality for writing web scrapers

scrapekit

Did you know the entire web was made of data? You probably did. Scrapekit helps you get that data with simple Python scripts. Based on requests, the library will handles caching, threading and logging.

See the full documentation.

Example

from scrapekit import Scraper

scraper = Scraper('example')

@scraper.task
def get_index():
  url = 'http://databin.pudo.org/t/b2d9cf'
  doc = scraper.get(url).html()
  for row in doc.findall('.//tr'):
    yield row

@scraper.task
def get_row(row):
  columns = row.findall('./td')
  print(columns)

pipeline = get_index | get_row
if __name__ == '__main__':
  pipeline.run()

Works well with

Scrapekit doesn't aim to provide all functionality necessary for scraping. Specifically, it doesn't address HTML parsing, data storage and data validation. For these needs, check the following libraries:

  • lxml for HTML/XML parsing; much faster and more flexible than BeautifulSoup.
  • dataset is a sister library of scrapekit that simplifies storing semi-structured data in SQL databases.

Existing tools

  • Scrapy is a much more mature and comprehensive framework for developing scrapers. On the other hand, it requires you to develop scrapers within its class system. This can be too heavyweight for a simple script to grab data off a web site.
  • scrapelib is a thin wrapper around requests that does throttling, retries and caching.
  • MechanicalSoup binds BeautifulSoup and requests into an imperative, stateful API.

Credits and license

Scrapekit is licensed under the terms of the MIT license, which is also included in LICENSE. It was developed through projects of ICFJ, ANCIR and ICIJ.

More Repositories

1

storyweb

A contextual news development environment.
CSS
50
star
2

extractors

Re-usable wrapper scripts for text document extractors.
Python
37
star
3

openinterests.eu

Exploring power and influence in the European Union by combining information from a variety of official EU data sources related to lobbying, expert groups, expenditure and procurement.
HTML
37
star
4

jsonmapping

Transform flat data structures into nested object graphs matching JSON schema definitions.
Python
27
star
5

archivekit

ArchiveKit manages data and documents during ETL processes, either on a local file system or on S3.
Python
14
star
6

journoid

Monitor datasets to check them for well-known contents (e.g. companies, people, etc.) and receive email notifications.
Python
12
star
7

loadkit

LoadKit supports Extract, Transform, Load processes based on ArchiveKit buckets.
Python
11
star
8

grano-old

SNA hacks
Python
10
star
9

lobbyfacts

[REPLACED BY pudo/openinterests.eu]
JavaScript
8
star
10

kompromatron

machtVZ zeigt die Verbindungen zwischen Politikern, Verbänden und Unternehmen die durch Parteispenden, Nebeneinkünfte und das Verbänderegister dokumentiert werden.
CSS
7
star
11

handelsregister

Scraper für Handelsregister.de
Python
6
star
12

dcat-tools

Tools for crawling/generating, indexing and searching DCat dataset descriptions.
Python
5
star
13

lobbytransparency

[REPLACED BY pudo/openinterests.eu]
JavaScript
5
star
14

grano-ql

DO NOT USE, THIS IS NOW IN CORE
Python
5
star
15

civic-software-checklist

REPLACED - by CivicPatterns.org
5
star
16

voteit-server

VoteIt, the best Poplus component there is.
Python
4
star
17

docstash

[REPLACED] All relevant functionality is moving to pudo/barn (which does documents but also data files, and manages derived versions of files)
Python
4
star
18

genesis

Tools regarding the scraping and processing of the German statistical database, GENESIS
JavaScript
3
star
19

eu-budget-scraper

Scraper and ETL tools for the Eurpean Union Budget
Python
3
star
20

opentext

OpenText Project
JavaScript
3
star
21

dpkg-eu-fts

[REPLACED BY pudo/openinterests.eu]
Python
3
star
22

rapex

Data project re EU RAPEX
3
star
23

mqlparser

Parser for queries according to the Metaweb Query Language (MQL)
Python
3
star
24

parlabla

Analyze speeches from the German parliament
JavaScript
2
star
25

spon-scraper

spiegel.de: Grabbing all the contents
Python
2
star
26

spon-api

SPIEGEL ONLINE Content API
JavaScript
2
star
27

grano-elasticsearch

ElasticSearch-based full-text indexing support for grano.
Python
2
star
28

legalese

Handling legal documents
Python
2
star
29

osmine

OpenSpending rudimentary datamining support.
Python
2
star
30

storypull

Pull requests for news.
JavaScript
2
star
31

mitzeichner

Scraper for the E-Petitions system of the German parliament
Python
1
star
32

datameta

A meta search engine and archival service for open data.
Python
1
star
33

regenesis-node

Node.js based ReGENESIS tool.
CoffeeScript
1
star
34

grano-reconcile

DO NOT USE, THIS IS NOW IN CORE
Python
1
star
35

slbackup

Create backups of ScribbleLive feeds
Python
1
star
36

grano-neo4j

Neo4J as a secondary (denormalized) datastore for grano
Python
1
star
37

corpcanvas

Corporate Networks Canvas
JavaScript
1
star
38

spendb.prototype

Lightweight OpenSpending clone for experimentation.
Python
1
star
39

dpkg-eu-cap

FarmSubsidy.org Data Importer
JavaScript
1
star
40

eutr

EU Transparency Register
Python
1
star