• Stars
    star
    414
  • Rank 104,550 (Top 3 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created almost 3 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
License License Version Version
Github Actions Github Actions Coverage CodeCov
Supported versions Python Versions Wheel Wheel
Status Status Downloads Downloads
All Contributors All Contributors

dude uncomplicated data extraction

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

๐Ÿšจ Dude is currently in Pre-Alpha. Please expect breaking changes.

Installation

To install, simply run the following from terminal.

pip install pydude
playwright install  # Install playwright binaries for Chrome, Firefox and Webkit.

Minimal web scraper

The simplest web scraper will look like this:

from dude import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

The example above will get all the hyperlink elements in a page and calls the handler function get_link() for each element.

How to run the scraper

You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.

dude scrape --url "<url>" --output data.json path/to/script.py

The output in data.json should contain the actual URL and the metadata prepended with underscore.

[
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 0,
    "url": "/url-1.html"
  },
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 1,
    "url": "/url-2.html"
  },
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 2,
    "url": "/url-3.html"
  }
]

Changing the output to --output data.csv should result in the following CSV content.

data.csv

Features

  • Simple Flask-inspired design - build a scraper with decorators.
  • Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
  • Data grouping - group related results.
  • URL pattern matching - run functions on matched URLs.
  • Priority - reorder functions based on priority.
  • Setup function - enable setup steps (clicking dialogs or login).
  • Navigate function - enable navigation steps to move to other pages.
  • Custom storage - option to save data to other formats or database.
  • Async support - write async handlers.
  • Option to use other parser backends aside from Playwright.
  • Option to follow all links indefinitely (Crawler/Spider).
  • Events - attach functions to startup, pre-setup, post-setup and shutdown events.
  • Option to save data on every page.

Supported Parser Backends

By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with. It is possible to use parser backends like BeautifulSoup4, Parsel, lxml, and Selenium.

Here is the summary of features supported by each parser backend.

Parser Backend Supports
Sync?
Supports
Async?
Selectors Setup
Handler
Navigate
Handler
Comments
CSS XPath Text Regex
Playwright โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ…
BeautifulSoup4 โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ
Parsel โœ… โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ
lxml โœ… โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ
Pyppeteer ๐Ÿšซ โœ… โœ… โœ… โœ… ๐Ÿšซ โœ… โœ… Not supported from 0.23.0
Selenium โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ โœ… โœ…

Using the Docker image

Pull the docker image using the following command.

docker pull roniemartinez/dude

Assuming that script.py exist in the current directory, run Dude using the following command.

docker run -it --rm -v "$PWD":/code roniemartinez/dude dude scrape --url <url> script.py

Documentation

Read the complete documentation at https://roniemartinez.github.io/dude/. All the advanced and useful features are documented there.

Requirements

  • โœ… Any dude should know how to work with selectors (CSS or XPath).
  • โœ… Familiarity with any backends that you love (see Supported Parser Backends)
  • โœ… Python decorators... you'll live, dude!

Why name this project "dude"?

  • โœ… A Recursive acronym looks nice.
  • โœ… Adding "uncomplicated" (like ufw) into the name says it is a very simple framework.
  • โœ… Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. ๐Ÿ˜Š

Author

Ronie Martinez

Contributors โœจ

Thanks goes to these wonderful people (emoji key):


Ronie Martinez

๐Ÿšง ๐Ÿ’ป ๐Ÿ“– ๐Ÿš‡

This project follows the all-contributors specification. Contributions of any kind welcome!

More Repositories

1

latex2mathml

Pure Python library for LaTeX to MathML conversion
Python
176
star
2

libqpsd

PSD (Photoshop Document) & PSB (Photoshop Big) Plugin for Qt/C++ (Qt4/Qt5)
C++
109
star
3

real-time-charts-with-flask

Sample application for the blog "Creating Real-Time Charts with Flask"
HTML
73
star
4

real-time-charts-with-fastapi

Sample application for the blog "Creating Real-Time Charts with FastAPI"
HTML
63
star
5

tauri-plugin-htmx

htmx plugin for Tauri
JavaScript
44
star
6

pulsars

A Tauri-based spreadsheet
Rust
33
star
7

browsers

Python library for detecting and launching browsers
Python
24
star
8

amortization

Python library for calculating amortizations and generating amortization schedules
Python
21
star
9

IPToCC

Get country code of IPv4/IPv6 address. Address lookup is done offline.
Python
17
star
10

HumanFramework

Human Framework: Test Automation Framework for Humansโ„ข
Python
8
star
11

docker-django-template

Template repository for a Docker+Django project
Python
6
star
12

qdsvtablemodel

Qt/C++ class for reading and writing CSV (Comma-Separated Values), TSV(Tab-Separated Values), and other DSV(Delimiter-Separated Values) files.
C++
5
star
13

DocCron

Schedule with Docstrings
Python
4
star
14

rate-limiting-with-python-and-redis

Sample application for the blog "Rate Limiting with Python and Redis"
Python
3
star
15

qpixmapwidget

QWidget-based class used for viewing QPixmaps
C++
2
star
16

rasam

Rasa Improved
Python
2
star
17

qpspplugin

PSP (PaintShop Pro) Plugin for Qt/C++ (Qt4/Qt5)
C++
2
star
18

2-domains-1-flask

Sample application for the blog Two Domains, One Flask
Python
2
star
19

flask-inliner

Flask-Inliner converts CSS <style> blocks to inline style attributes
Python
2
star
20

django_htmx_minesweeper

Django + HTMX Minesweeper
Python
1
star
21

qbox2d

An attempt to use Erin Catto's Box2D (http://www.box2d.org) API with Qt/C++ Graphics VIew Framework
C++
1
star
22

libqpcx

Qt C++ plugin for ZSoft's PCX (Personal Computer Exchange) and DCX (Multipage PCX) image file format
C++
1
star
23

wet

Web Extension Template
JavaScript
1
star
24

SiteSummarizerBot

I am a bot that summarizes content of a URL-only submission on Reddit
Python
1
star