Crawler
A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.
Features
- Crawl assets (javascript, css and images).
- Save to disk.
- Hook for scraping content.
- Restrict crawlable domains, paths or content types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set the maximum crawl depth.
- Set timeouts.
- Set retries strategy.
- Set crawler's user agent.
- Manually pause/resume/stop the crawler.
See Hex documentation.
Architecture
Below is a very high level architecture diagram demonstrating how Crawler works.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
There are several ways to access the crawled page data:
- Use
Crawler.Store
- Tap into the registry(?)
Crawler.Store.DB
- Use your own scraper
- If the
:save_to
option is set, pages will be saved to disk in addition to the above mentioned places - Provide your own custom parser and manage how data is stored and accessed yourself
Configurations
Option | Type | Default Value | Description |
---|---|---|---|
:assets |
list | [] |
Whether to fetch any asset files, available options: "css" , "js" , "images" . |
:save_to |
string | nil |
When provided, the path for saving crawled pages. |
:workers |
integer | 10 |
Maximum number of concurrent workers for crawling. |
:interval |
integer | 0 |
Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. |
:max_depths |
integer | 3 |
Maximum nested depth of pages to crawl. |
:timeout |
integer | 5000 |
Timeout value for fetching a page, in ms. Can also be set to :infinity , useful when combined with Crawler.pause/1 . |
:user_agent |
string | Crawler/x.x.x (...) |
User-Agent value sent by the fetch requests. |
:url_filter |
module | Crawler.Fetcher.UrlFilter |
Custom URL filter, useful for restricting crawlable domains, paths or content types. |
:retrier |
module | Crawler.Fetcher.Retrier |
Custom fetch retrier, useful for retrying failed crawls. |
:modifier |
module | Crawler.Fetcher.Modifier |
Custom modifier, useful for adding custom request headers or options. |
:scraper |
module | Crawler.Scraper |
Custom scraper, useful for scraping content as soon as the parser parses it. |
:parser |
module | Crawler.Parser |
Custom parser, useful for handling parsing differently or to add extra functionalities. |
:encode_uri |
boolean | false |
When set to true apply the URI.encode to the URL to be crawled. |
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
end
URL Filter
See Crawler.Fetcher.UrlFilter
.
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
end
Scraper
See Crawler.Scraper
.
defmodule CustomScraper do
@behaviour Crawler.Scraper.Spec
end
Parser
See Crawler.Parser
.
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
end
Modifier
defmodule CustomModifier do
@behaviour Crawler.Fetcher.Modifier.Spec
end
Pause / Resume / Stop Crawler
Crawler provides pause/1
, resume/1
and stop/1
, see below.
{:ok, opts} = Crawler.crawl("http://elixir-lang.org")
Crawler.pause(opts)
Crawler.resume(opts)
Crawler.stop(opts)
Please note that when pausing Crawler, you would need to set a large enough :timeout
(or even set it to :infinity
) otherwise parser would timeout due to unprocessed links.
API Reference
Please see https://hexdocs.pm/crawler.
Changelog
Please see CHANGELOG.md.
Copyright and License
Copyright (c) 2016 Fred Wu
This work is free. You can redistribute it and/or modify it under the terms of the MIT License.