• This repository has been archived on 27/Jun/2022
  • Stars
    star
    78
  • Rank 398,723 (Top 9 %)
  • Language
    Crystal
  • License
    MIT License
  • Created almost 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Powerful web scraping framework for Crystal

Arachnid

This project is no longer maintained. Please see Mechanize for an alternative.

Arachnid is a fast, soon to be multi-threading capable web crawler for Crystal. It recenty underwent a full rewrite for Crystal 0.35.1, so see the documentation below for updated usage instructions.

Installation

  1. Add the dependency to your shard.yml:

    dependencies:
      arachnid:
        github: watzon/arachnid
  2. Run shards install

Usage

First, of course, you need to require arachnid in your project:

require "arachnid"

The Agent

Agent is the class that does all the heavy lifting and will be the main one you interact with. To create a new Agent, use Agent.new.

agent = Arachnid::Agent.new

The initialize method takes a bunch of optional parameters:

:client

You can, if you wish, supply your own HTTP::Client instance to the Agent. This can be useful if you want to use a proxy, provided the proxy client extends HTTP::Client.

:user_agent

The user agent to be added to every request header. You can override this on a per-host basis with either :host_headers or :default_headers.

:default_headers

The default headers to be used in every request.

:host_headers

Headers to be applied on a per-host basis. This is a hash String (host name) => HTTP::Headers.

:queue

The Arachnid::Queue instance to use for storing links waiting to be processed. The default is a MemoryQueue (which is the only one for now), but you can easily implement your own Queue using whatever you want as a backend.

:stop_on_empty

Whether or not to stop running when the queue is empty. This is true by default. If it's made false, the loop will continue when the queue empties, so be sure you have a way to keep adding items to the queue.

:follow_redirects

Whether or not to follow redirects (add them to the queue).

Starting the Agent

There are four ways to start your Agent once it's been created. Here are some examples:

#start_at

#start_at starts the Agent running on a particular URL. It adds a single URL to the queue and starts there.

agent.start_at("https://crystal-lang.org") do
  # ...
end

#site

#site starts the agent running at the given URL and adds a rule that keeps the agent restricted to the given site. This allows the agent to scan the given domain and any subdomains. For instance:

agent.site("https://crystal-lang.org") do
  # ...
end

The above will match crystal-lang.org and forum.crystal-lang.org, but not github.com/crystal-lang or any other site not within the *.crystal-lang.org space.

#host

#host is like site, but with the added restriction of just remaining on the current domain path. Subdomains are not included.

agent.host("crystal-lang.org") do
  # ...
end

#start

Provided you already have URIs in the queue ready to be scanned, you can also just use #start to start the Agent running.

agent.enqueue("https://crystal-lang.org")
agent.enqueue("https://kemalcr.com")
agent.start

Filters

URI's can be filtered before being enqueued. There are two kinds of filters, accept and reject. Accept filters can be used to ensure that a URI matches before being enqueued. Reject filters do the opposite, keeping URIs from being enqueued if they do match.

For instance:

# This will filter out all sites where the host is not "crystal-lang.org"
agent.accept_filter { |uri| uri.host == "crystal-lang.org" }

If you want to ignore certain parts of the above filter:

# This will ignore paths starting with "/api"
agent.reject_filter { |uri| uri.path.to_s.starts_with?("/api") }

The #site and #host methods add a default accept filter in order to keep things in the given site or host.

Resources

All the above is useless if you can't do anything with the scanned resources, which is why we have the Resource class. Every scanned resource is converted into a Resource (or subclass) based on the content type. For instance, text/html becomes a Resource::HTML which is parsed using kostya/myhtml for extra speed.

Each resource has an associated Agent#on_ method so you can do something when one of those resources is scanned:

agent.on_html do |page|
  puts typeof(page)
  # => Arachnid::Resource::HTML

  puts page.title
  # => The Title of the Page
end

Currently we have:

  • #on_html
  • #on_image
  • #on_script
  • #on_stylesheet
  • #on_xml

There is also #on_resource which is called for every resource, including ones that don't match the above types. Resources all include, at minimum the URI at which the resource was found, and the response (HTTP::Client::Response) instance.

Contributing

  1. Fork it (https://github.com/watzon/arachnid/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors

More Repositories

1

marionette

Selenium alternative for Crystal. Browser manipulation without the Java overhead.
Crystal
162
star
2

fbmdob

Facebook image Metadata Obfuscation server
Vue
159
star
3

wsl-proxy

WSL proxy files for editor/linux interop
Batchfile
140
star
4

salesforce-angular2-boilerplate

Boilerplate for building an Angular 2 application on the Salesforce platform. Includes the ability to develop locally using the Salesforce REST API
TypeScript
33
star
5

JackOS

i686 kernel written in Nim
Nim
29
star
6

adonis-jsonable

AdonisJs Trait Provider that aims to solve the problems with using Postgres' JSON types
JavaScript
26
star
7

cru

LibUI based GUI framework for Crystal
Crystal
25
star
8

ngrok.cr

Ngrok wrapper for Crystal
Crystal
25
star
9

ROT26

Pure Crystal implementation of the ROT26 encryption algorithm
Crystal
23
star
10

telethon-session-mongo

MongoDB backend for Telethon session storage
Python
21
star
11

pixie

Making magic with Crystal and images (using ImageMagick)
Crystal
18
star
12

arg_parser

A powerful JSON::Serializable like argument parser for Crystal
Crystal
18
star
13

zhtml

HTML parser built in Zig
Zig
17
star
14

subnet

Crystal library for working with IPv4 and IPv6 addresses
Crystal
15
star
15

browser

Browser detection library for Crystal
Crystal
13
star
16

xmler

Python library to convert dictionaries into valid XML. Supports namespaces.
Python
12
star
17

nacl

Crystal bindings to libsodium (WIP)
Crystal
12
star
18

octokit.cr

Crystal toolkit for the GitHub API (in development)
Crystal
12
star
19

json-to-crystal

Convert JSON structures into Crystal classes with JSON mappings
JavaScript
12
star
20

lucky_have_i_been_pwned_validator

Have I Been Pwned password validator for LuckyFramework
Crystal
10
star
21

safety

Safe types for V
V
10
star
22

spotlight

Search engine parsing for Crystal
Crystal
10
star
23

github-api-nim

Nim wrapper for the GitHub API [WIP]
Nim
10
star
24

apatite

Apatite is a fundamental package for scientific computing with Crystal
Crystal
9
star
25

paste69

Simple CURL-able pastebin
Crystal
9
star
26

identicon

Pure Crystal identicon generator
Crystal
8
star
27

inkify

API based replacement for carbon.now.sh
Rust
8
star
28

diff

Pure Crystal implementation of various diffing algorithms
Crystal
8
star
29

docker-api

Crystal wrapper for the Docker API
Crystal
8
star
30

opengraph.cr

Crystal wrapper for the Open Graph protocol
Crystal
7
star
31

lucky_inertia

Inertia.js adapter for the Lucky web framework
Crystal
7
star
32

freetype.cr

Crystal bindings to Freetype2
Crystal
7
star
33

vnntp

Implementation of RFC 3977 (Network News Transfer Protocol) for V
V
7
star
34

cor

Make working with colors in Crystal fun!
Crystal
7
star
35

crodoc

Robust Crystal documentation generator inspired by YARD. Just an idea right now, don't get your hopes up.
Crystal
7
star
36

gravity

WIP Annotation based ORM for Crystal
Crystal
6
star
37

robots.cr

Simple robots.txt parser for Crystal
Crystal
6
star
38

fenneco

Crystal
6
star
39

remind_me_bot

Crystal
5
star
40

ffmpeg.cr

Crystal
5
star
41

wasm-crypto

Various Rust crypto libraries wrapped in WASM for use in the browser
TypeScript
5
star
42

telegram.cr

Telegram Bot API library written in Crystal and designed with Lucky integration in mind
Crystal
5
star
43

stringscan.js

StringScanner for JavaScript
JavaScript
4
star
44

lucky_htmx

Crystal
4
star
45

tomi

Unified crypto currency wallet generation API for Crystal
Crystal
4
star
46

massclone

Easily clone all of your gihub repos
Nim
4
star
47

kutt

Simple Crystal API wrapper for kutt.it
Crystal
3
star
48

html_parser

Port of fb55/htmlparser2 for Deno
TypeScript
3
star
49

fckeverywordbot

Inspired by the Fuck Every Word project on Twitter, this is a Telegram bot that does the same
Crystal
3
star
50

nimtgbot-starter

C
3
star
51

easy_oauth

Crystal OAuth headers made simple
Crystal
3
star
52

cracker.js

Easy to use bot/userbot framework built on top of GramJS; Gram + Cracker = ๐Ÿงก
TypeScript
3
star
53

sterile

Crystal
3
star
54

extensions

Helpful extensions to existing Crystal classes and modules
Crystal
3
star
55

esto.cr

E621 API client for Crystal
Crystal
3
star
56

pluscode

Crystal implementation of Open Location Codes
Crystal
3
star
57

paste69-old

Crystal
3
star
58

coinpayments.cr

CoinPayments API wrapper for Crystal
Crystal
3
star
59

sysinfo.cr

Psutil for Crystal
Crystal
3
star
60

mint-tabler

Tabler icon components for Mint Lang
Mint
3
star
61

crego

Steganography library for Crystal
Crystal
3
star
62

qrv

Generate qr codes with vlang using libqrencode
Coq
3
star
63

kantek

Pluggable Telegram userbot
Python
3
star
64

vento

Telegram Bot API library for V
V
2
star
65

BigInteger.ts

Deno first BigInt wrapper based on peterolson/BigInteger.js
TypeScript
2
star
66

paste69-fresh

TypeScript
2
star
67

utilibot

Crystal
2
star
68

hikoki

Hikลki is a telegram userbot to help me with group and spam management
Python
2
star
69

crystalexbot

Crystal
2
star
70

manycoin.js

An expressive, promise driven wrapper around the JSON-RPC API's for various crypto currencies
JavaScript
2
star
71

waves.cr

WIP WAVE format reader in Crystal
Crystal
2
star
72

tweeter

Twitter API wrapper for Crystal
Crystal
2
star
73

tidal.cr

Unofficial Tidal API client for Crystal
Crystal
1
star
74

synaptic.cr

Architecture-free neural network library for Crystal
Crystal
1
star
75

watzon

1
star
76

tgp

WIP Python implementation of kitty's terminal graphics protocol
Python
1
star
77

dogefaucetbot

JavaScript
1
star
78

sirbansabot

Crystal
1
star
79

zedcoinbot

JavaScript
1
star
80

watzon.tech-contents

1
star
81

twitter.cr

Twitter API library
Crystal
1
star
82

sysutil_gem

Ruby
1
star
83

crystal-posh

Poshmark API client for Crystal
Crystal
1
star
84

oxford-dictionary.cr

Oxford Dictionary API wrapper for Crystal
Crystal
1
star
85

protokit

MTProto + Typescript = โค๏ธ
1
star
86

monocypherjs

Zig
1
star
87

salesforce_client_tracker

Chrome extension for tracking current clients on salesforce
JavaScript
1
star
88

belljs

GramJS for Deno
TypeScript
1
star
89

crystal-razer

Facilitate control of your Razer chroma devices. Based on python razer-drivers (but much faster).
Crystal
1
star
90

ohshit

Oh Shit! is the command fixer you didn't know you needed. Inspired by thefuck.
Crystal
1
star
91

structure

Python struct like library for Crystal
Crystal
1
star
92

fenneco-js

TypeScript
1
star
93

wrapi

REST API wrapping framework written in Crystal
Crystal
1
star
94

wallet-watcher

Watches your garlicoin wallet and notifies you if there's a change in your balance
Crystal
1
star
95

tgcmndr

Telegram group management and stats
1
star
96

dolly

WIP port of Playwright for Crystal
Crystal
1
star
97

reader

A set of methods for processing keyboard input in character, line and multiline modes.
1
star
98

crystal-emoji-regex

๐Ÿ’Ž A set of Crystal regular expressions for matching Unicode Emoji symbols.
Crystal
1
star
99

aoc2023

CSS
1
star