• Stars
    star
    614
  • Rank 70,964 (Top 2 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

crawl and scrape web pages in rust

voyager

github crates.io docs.rs build status

With voyager you can easily extract structured data from websites.

Write your own crawler/scraper with voyager following a state machine model.

Example

The examples use tokio as its runtime, so your Cargo.toml could look like this:

[dependencies]
voyager = { version = "0.1" }
tokio = { version = "1", features = ["full"] }

Declare your own Scraper and model

// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
    post_selector: Selector,
    author_selector: Selector,
    title_selector: Selector,
    comment_selector: Selector,
    max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
    Page(usize),
    Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
    author: String,
    url: Url,
    link: Option<String>,
    title: String,
}

Implement the voyager::Scraper trait

A Scraper consists of two associated types:

  • Output, the type the scraper eventually produces
  • State, the type, the scraper can drag along several requests that eventually lead to an Output

and the scrape callback, which is invoked after each received response.

Based on the state attached to response you can supply the crawler with new urls to visit with, or without a state attached to it.

Scraping is done with causal-agent/scraper.

impl Scraper for HackernewsScraper {
    type Output = Entry;
    type State = HackernewsState;

    /// do your scraping
    fn scrape(
        &mut self,
        response: Response<Self::State>,
        crawler: &mut Crawler<Self>,
    ) -> Result<Option<Self::Output>> {
        let html = response.html();

        if let Some(state) = response.state {
            match state {
                HackernewsState::Page(page) => {
                    // find all entries
                    for id in html
                        .select(&self.post_selector)
                        .filter_map(|el| el.value().attr("id"))
                    {
                        // submit an url to a post
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/item?id={}", id),
                            HackernewsState::Post,
                        );
                    }
                    if page < self.max_page {
                        // queue in next page
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/news?p={}", page + 1),
                            HackernewsState::Page(page + 1),
                        );
                    }
                }

                HackernewsState::Post => {
                    // scrape the entry
                    let entry = Entry {
                        // ...
                    };
                    return Ok(Some(entry))
                }
            }
        }

        Ok(None)
    }
}

Setup and collect all the output

Configure the crawler with via CrawlerConfig:

  • Allow/Block list of Domains
  • Delays between requests
  • Whether to respect the Robots.txt rules

Feed your config and an instance of your scraper to the Collector that drives the Crawler and forwards the responses to your Scraper.

use voyager::scraper::Selector;
use voyager::*;
use tokio::stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    
    // only fulfill requests to `news.ycombinator.com`
    let config = CrawlerConfig::default().allow_domain_with_delay(
        "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
    );
    
    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(
        "https://news.ycombinator.com/news",
        HackernewsState::Page(1),
    );

    while let Some(output) = collector.next().await {
        let post = output?;
        dbg!(post);
    }
    
    Ok(())
}

See examples for more.

Inject async calls

Sometimes it might be helpful to execute some other calls first, get a token etc., You submit async closures to the crawler to manually get a response and inject a state or drive a state to completion

fn scrape(
    &mut self,
    response: Response<Self::State>,
    crawler: &mut Crawler<Self>,
) -> Result<Option<Self::Output>> {

    // inject your custom crawl function that produces a `reqwest::Response` and `Self::State` which will get passed to `scrape` when resolved.
    crawler.crawl(move |client| async move {
        let state = response.state;
        let auth = client.post("some auth end point ").send()?.await?.json().await?;
        // do other async tasks etc..
        let new_resp = client.get("the next html page").send().await?;
        Ok((new_resp, state))
    });
    
    // submit a crawling job that completes to `Self::Output` directly
    crawler.complete(move |client| async move {
        // do other async tasks to create a `Self::Output` instance
        let output = Self::Output{/*..*/};
        Ok(Some(output))
    });
    
    Ok(None)
}

Recover a state that got lost

If the crawler encountered an error, due to a failed or disallowed http request, the error is reported as CrawlError, which carries the last valid state. The error then can be down casted.

let mut collector = Collector::new(HackernewsScraper::default(), config);

while let Some(output) = collector.next().await {
  match output {
    Ok(post) => {/**/}
    Err(err) => {
      // recover the state by downcasting the error
      if let Ok(err) = err.downcast::<CrawlError<<HackernewsScraper as Scraper>::State>>() {
        let last_state = err.state();
      }
    }
  }
}

Licensed under either of these:

More Repositories

1

chromiumoxide

Chrome Devtools Protocol rust API
Rust
522
star
2

cargo-memex

compile rust code into memes
Rust
245
star
3

defi-bindings

rust bindings for various defi projects
Rust
63
star
4

diffeq

Basic Ordinary Differential Equation solvers
Rust
49
star
5

extrablatt

Article scraping in rust
Rust
46
star
6

cairo-lang-rs

Rust
38
star
7

libp2p-bittorrent-protocol

Rust Implementation of the BitTorrent protocol for libp2p.
Rust
23
star
8

hyperswarm-dht

rust implementation fo the DHT powering the HyperSwarm stack
Rust
17
star
9

str-distance

String Distances in rust
Rust
13
star
10

crossref-rs

A rust client for the Crossref-API
Rust
13
star
11

hypertrie

Secure, distributed single writer key/value store
Rust
11
star
12

hardhat-anvil

TypeScript
9
star
13

plcopen-xml-xcore

The PlcOpen Xml Standard implemented as Xcore model and with full gradle and maven support
Java
9
star
14

archiveis-rs

A simple rust wrapper for the archive.is capturing service.
Rust
6
star
15

kaggle-rs

unofficial rust bindings for the kaggle api
Rust
6
star
16

aoc2021

Python
4
star
17

nom-sparql

Sparql parser written in rust using nom
Rust
4
star
18

rustika

rust bindings to the Apache Tikaâ„¢ REST services
Rust
4
star
19

tia-openness-xcore

Xcore model of the TIA Openness documents.
Java
3
star
20

tokio-interval-repro

Rust
3
star
21

forge-ethers-rs-template

a template for starting an ethers-rs project backed by forge
3
star
22

substrate-exchange-xcmp

Deposit, withdraw and swap (parachain) assets
Rust
2
star
23

rust-plexapi

Rust bindings for the Plex API.
Rust
2
star
24

template

basic rust template
Rust
2
star
25

rust-s7

Rust
2
star
26

plcopen-xml

java parser for the plcopen-xml standard
Java
1
star
27

callgraph-graphml

Java
1
star
28

rust-ads

Implementation of Beckhoff's ADS protocol in pure rust.
Rust
1
star
29

windows-envs

Rust
1
star
30

smolgrad

Rust
1
star