• Stars
    star
    315
  • Rank 132,951 (Top 3 %)
  • Language
    Python
  • Created over 9 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python web scraping framework

Cyborg

Cyborg is an asyncio Python 3 web scraping framework that helps you write programs to extract information from websites by reading and inspecting their HTML.

What?

Scraping websites for data can be fairly complex when you are dealing with data across multiple pages, request limits and error handling. Cyborg aims to handle all of this for you transparently, so that you can focus on the actual extraction of data rather than all the stuff around it. It does this by helping you break the process down into smaller chunks, which can be combined into a Pipeline, for example below is a Pipeline that scrapes takeaway reviews from Just-Eat (the complete example can be found in examples/just-eat):

with open("output.json", "w") as output_fd:
    pipeline = Job("ReviewScraper") | scrape_places | unique("id") | scrape_reviews.parallel(5)
    pipeline < string.ascii_lowercase
    pipeline > output_fd

    pipeline.monitor() > sys.stdout

    pipeline.run_until_complete()

The pipeline has several stages:

  1. scrape_places - This scrapes the list of takeaways from a particular area. The area is found by the first letter of the postcode, so we brute-force this by inputting a-z (pipeline < string.ascii_lowercase)

  2. unique('id') - Takeaways may serve more than one area, this filters out any duplicate takeaways based on their ID

  3. scrape_reviews.parallel(5) - This starts 5 parallel tasks to scrape the reviews from a particular takeaway.

More Repositories

1

gping

Ping, but with a graph
Rust
10,623
star
2

html-query

jq, but for HTML
HTML
622
star
3

simple

Simple is a clone of Obtvse written in Python running on Flask.
CSS
505
star
4

xcat

XPath injection tool
Python
355
star
5

django-debug-toolbar-template-timings

A django-debug-toolbar panel that displays template rendering times for your Django application
Python
296
star
6

git-workspace

Sync personal and work git repositories from multiple providers 🚀
Rust
281
star
7

dirscan

A high performance tool for summarizing large directories or drives
Rust
141
star
8

inliner

Automagically inline python methods
Python
102
star
9

cargo-bloat-action

Track rust binary sizes across builds using Github Actions
TypeScript
96
star
10

wordinserter

Insert HTML or Markdown into a Word document
Python
82
star
11

bare-hugo-theme

A Hugo theme based on Bulma.io
HTML
72
star
12

datatables

SQLAlchemy->Datatables
Python
52
star
13

ptail

Stream and display a fixed number of lines from a processes output.
Rust
49
star
14

human_id

Human readable IDs, in Python
Python
44
star
15

MovieFinder

A basic movie recommendation site built using Python, Flask, SQLAlchemy and Backbone.js
JavaScript
31
star
16

ripgrep-structured

Ripgrep over structured data
Rust
24
star
17

crontabula

Parse crontab expressions with Python
Python
23
star
18

websocket_stdout_example

Use websockets with twisteds ProcessProtocol
Python
21
star
19

django-docker-box

See https://github.com/django/django-docker-box
Python
21
star
20

xcat_app

A XPath injection demonstration application
Java
20
star
21

django-choice-object

A choice object for Django
Python
17
star
22

spam

A tool to graph who has sent you the most emails
Python
17
star
23

HtmlToWord

Render HTML to a specific portion of a word document using Python and PyWin32
Python
16
star
24

dotfiles

My dotfiles.
Nushell
14
star
25

cel-rust-original

Rust
13
star
26

pytest-scrutinize

Find bottlenecks in your test suites
Python
12
star
27

xpath-expressions

Treat XPath expressions as Python objects
Python
11
star
28

petal

🌺 Petal - Flask, for gRPC services.
Python
11
star
29

TinyLink

Small link-shortening service written in Django
JavaScript
10
star
30

CTF

Simple capture the flag web application
JavaScript
9
star
31

django-github-actions

Github actions PoC for Django
Python
7
star
32

pinger

Archived: Now part of https://github.com/orf/gping
Rust
7
star
33

uni_timetables

A quick timetabling application written in Python using Flask
JavaScript
6
star
34

cvsslib

A library implementing CVSS v2 and v3 scores
Python
6
star
35

aio-pipes

Asynchronous pipes in Python
Python
6
star
36

hnewssimulator

Hacker news simulator using Markov chains. Very messy at the moment.
Python
6
star
37

alfred-quip-workflow

Fulltext, local Quip document search
Python
6
star
38

deterministic-zip

Deterministic zipfiles, with Rust
Rust
5
star
39

pyvector

https://vector.dev/ embedded inside Python
Rust
5
star
40

django-performance-metrics

Python
5
star
41

alfred-pycharm

Quickly open Pycharm projects via Alfred
Python
4
star
42

s3-deletion-visualizer

Rust
4
star
43

howslow_django

4
star
44

hncat

Grab all Hacker News stores + comments, quickly.
Rust
3
star
45

redis-parser

Rust
3
star
46

watchman-client

Python
3
star
47

apple-music-importer

Import your Library.xml file into Apple Music
TypeScript
3
star
48

digest

Simple RSS digester
2
star
49

pypaper

A windows desktop background manager written in Python
Python
2
star
50

Gmail-dumper

Dump Gmail inboxes
Python
2
star
51

cargo-bloat-backend

Python
2
star
52

blog-hugo

My blog!
CSS
2
star
53

logbot

Logbot tails local log files to an IRC channel.
Python
2
star
54

homebrew-brew

Personal homebrew things
Ruby
1
star
55

workaround

Python
1
star
56

Facebook-link-stats

Half finished facebook application that would track links shared on facebook.
Python
1
star
57

vulnerable_website

A vulnerable website I made for a presentation
CSS
1
star
58

wow_economy

Word of Warcraft auction price average thing.
Python
1
star
59

FindMeChicken-mono

C#
1
star
60

trend

Simple terminal graphs
Rust
1
star
61

proximity-db

euclidean distance calculations, fast.
Rust
1
star
62

circleci-inspector

Python
1
star
63

Wikipedia-XML-Processor

Wikipedia XML Processor
C#
1
star
64

presentations

Presentations I've given since 2019
Shell
1
star
65

ripgrep-stream

Rust
1
star