• Stars
    star
    917
  • Rank 48,497 (Top 1.0 %)
  • Language
    Elixir
  • Created almost 8 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A high performance web crawler / scraper in Elixir.

Crawler

Build Status CodeBeat Coverage Module Version Hex Docs Total Download License Last Updated

A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.

Features

  • Crawl assets (javascript, css and images).
  • Save to disk.
  • Hook for scraping content.
  • Restrict crawlable domains, paths or content types.
  • Limit concurrent crawlers.
  • Limit rate of crawling.
  • Set the maximum crawl depth.
  • Set timeouts.
  • Set retries strategy.
  • Set crawler's user agent.
  • Manually pause/resume/stop the crawler.

See Hex documentation.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

  1. Use Crawler.Store
  2. Tap into the registry(?) Crawler.Store.DB
  3. Use your own scraper
  4. If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
  5. Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option Type Default Value Description
:assets list [] Whether to fetch any asset files, available options: "css", "js", "images".
:save_to string nil When provided, the path for saving crawled pages.
:workers integer 10 Maximum number of concurrent workers for crawling.
:interval integer 0 Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
:max_depths integer 3 Maximum nested depth of pages to crawl.
:timeout integer 5000 Timeout value for fetching a page, in ms. Can also be set to :infinity, useful when combined with Crawler.pause/1.
:user_agent string Crawler/x.x.x (...) User-Agent value sent by the fetch requests.
:url_filter module Crawler.Fetcher.UrlFilter Custom URL filter, useful for restricting crawlable domains, paths or content types.
:retrier module Crawler.Fetcher.Retrier Custom fetch retrier, useful for retrying failed crawls.
:modifier module Crawler.Fetcher.Modifier Custom modifier, useful for adding custom request headers or options.
:scraper module Crawler.Scraper Custom scraper, useful for scraping content as soon as the parser parses it.
:parser module Crawler.Parser Custom parser, useful for handling parsing differently or to add extra functionalities.
:encode_uri boolean false When set to true apply the URI.encode to the URL to be crawled.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See Crawler.Fetcher.Modifier.

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("http://elixir-lang.org")

Crawler.pause(opts)

Crawler.resume(opts)

Crawler.stop(opts)

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

Copyright and License

Copyright (c) 2016 Fred Wu

This work is free. You can redistribute it and/or modify it under the terms of the MIT License.

More Repositories

1

jquery-endless-scroll

Endless/infinite scrolling/pagination.
CoffeeScript
838
star
2

angel_nest

Project code name: Angel Nest. :)
Ruby
775
star
3

api_taster

A quick and easy way to visually test your Rails application's API.
Ruby
729
star
4

simple_bayes

A Naive Bayes machine learning implementation in Elixir.
Elixir
392
star
5

datamappify

Compose, decouple and manage domain logic and data persistence separately. Works particularly great for composing form objects!
Ruby
332
star
6

opq

Elixir queue! A simple, in-memory queue with worker pooling and rate limiting in Elixir.
Elixir
255
star
7

stemmer

An English (Porter2) stemming implementation in Elixir.
Elixir
149
star
8

bustle

Activities recording and retrieving using a simple Pub/Sub-like interface.
Ruby
93
star
9

ruby_decorators

Ruby method decorators inspired by Python.
Ruby
63
star
10

inherited_resources_views

Share and DRY up views between resources. Use with Inherited Resources.
Ruby
60
star
11

jquery-inline-confirmation

Inline Confirmation plugin for jQuery. One of the less obtrusive ways of implementing confirmation dialogues.
JavaScript
53
star
12

toy-robot-elixir

The infamous Toy Robot code test done in Elixir.
Elixir
45
star
13

skinny-coffee-machine

A simple JavaScript state machine with observers, for browsers and Node.js.
JavaScript
42
star
14

kohana-phamlp

This module is a bridge between the Kohana PHP framework (http://kohanaframework.org/) and the PHamlP library (http://code.google.com/p/phamlp/).
PHP
25
star
15

authlite

Authlite, an auth module for Kohana PHP framework, it offers greater flexibility than the official Auth module.
PHP
23
star
16

dotfiles

My dotfiles
Shell
18
star
17

amaze_hands

Amaze Hands is an amazing tool developed for analysing Kanban board cards.
Ruby
15
star
18

kthrottler

A Kohana port of Action Throtller (for Rails): http://github.com/fredwu/action_throttler
PHP
14
star
19

jquery-slideshow-lite

An extremely lightweight slideshow plugin for jQuery.
JavaScript
14
star
20

code-test-2016-cultureamp

Ruby
13
star
21

README-xplor

Fred @ Xplor - how to work with me.
10
star
22

code-test-2016-myob

Ruby
8
star
23

code-test-2016-trunkplatform

Ruby
6
star
24

action_throttler

An easy to use Rails plugin to quickly throttle application actions based on configurable duration and limit.
Ruby
6
star
25

app_reset

Resets (and if available, seeds) your databases.
Ruby
6
star
26

yield.rb

Aggregated token amounts and values. Supports ApeBoard, YieldWatch, Binance, CoinGecko and more.
Ruby
5
star
27

layerful

Layerful PHP framework.
4
star
28

advent_of_code_2018

https://adventofcode.com/2018/about
Elixir
4
star
29

security_guard

A collection of useful tools for auditing data and performing security checks.
Ruby
3
star
30

ruby-slim-tmbundle

https://github.com/slim-template/ruby-slim.tmbundle
3
star
31

fredwu.me-v3

JavaScript
3
star
32

flower

Playground to test out the Lotus framework.
Ruby
2
star
33

kata-poker-hands-elixir

A coding kata for comparing poker hands - Elixir version.
Elixir
2
star
34

jqstub

A simple stub library for jQuery / Zepto objects.
JavaScript
1
star
35

project_retard

One sale a day e-commerce platform built on Ruby on Rails.
JavaScript
1
star
36

reacraft

Ruby
1
star
37

toy-robot-lolz

It's art. And it's beautiful.
Ruby
1
star
38

kata-poker-hands-ruby

A coding kata for comparing poker hands - Ruby version.
Ruby
1
star
39

spiky_xml

Just a spike on XML parsing in different environments.
JavaScript
1
star
40

code-test-2016-adslot

CoffeeScript
1
star