• Stars
    star
    277
  • Rank 148,875 (Top 3 %)
  • Language
    Go
  • License
    MIT License
  • Created about 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A web crawler for Go



ant (alpha) is a web crawler for Go.








Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.

var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.


Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)


More Repositories

1

phony

Tiny command line fake data generator.
Go
733
star
2

k

keyboard event dispatcher.
JavaScript
185
star
3

editable

Fixing contenteditable.
JavaScript
178
star
4

select

modern <select>
JavaScript
102
star
5

store

local store, unserializes and serializes values automagically :)
JavaScript
68
star
6

shortcuts

keyboard shortcuts, similiar to component/events.
JavaScript
53
star
7

gravy

saucelabs
JavaScript
44
star
8

sortable

UI Sortable.
JavaScript
42
star
9

fmt

tiny fmt utility
JavaScript
39
star
10

coverage

code coverage
CSS
39
star
11

sublime-go

An opinionated plugin for Go.
Python
38
star
12

component-bundle

component-bundle(1)
JavaScript
37
star
13

uniq-selector

get a uniq css selector from element.
JavaScript
37
star
14

component-graph

component-graph
JavaScript
29
star
15

serialize

serialize a form to urlencoded string.
JavaScript
27
star
16

colorpicker

minimal colorpicker.
JavaScript
24
star
17

pick

"pick" stuff from html source.
Go
24
star
18

redact-popover

medium inspired editor popover
JavaScript
18
star
19

dex

Lightweight IndexedDB wrapper
JavaScript
18
star
20

paper-stack

paper stack effect with css.
17
star
21

select-reflect

reflect native <select> to yields/select instance.
JavaScript
16
star
22

lru-cache

LRU Cache
JavaScript
15
star
23

css-ease

CSS Easing functions
JavaScript
14
star
24

ago

Date(now - 1e3) => "a second ago"
JavaScript
13
star
25

on-select

Invoke a callback when a user selects some text.
JavaScript
12
star
26

instrument

in-browser code coverage instrumentation
JavaScript
11
star
27

cycle

modern selectbox.
JavaScript
10
star
28

k-sequence

keyboard sequences
JavaScript
10
star
29

measure-string

Measure a string width.
JavaScript
9
star
30

stream-log

stream logger
JavaScript
9
star
31

is-touch

Check if touch is supported.
JavaScript
9
star
32

to-element

get a node from value.
JavaScript
9
star
33

clear-timeouts

clear all timeouts
JavaScript
9
star
34

mongoose-time

timestamps for mongoose schemas
JavaScript
9
star
35

traverse

low level traverse function, inspired by $.dir
JavaScript
8
star
36

scrolltop

get the window's scrolltop value, cross-browser.
JavaScript
8
star
37

zip

zip stuff.
JavaScript
8
star
38

path-lookup

lookup path within `object`.
JavaScript
8
star
39

grow-width

grow input's width.
JavaScript
7
star
40

keycode

name to keycode
JavaScript
7
star
41

clear-intervals

clear all intervals
JavaScript
7
star
42

status

user activity emitter, "idle" "active" etc..
JavaScript
7
star
43

mongoose-slug

mongoose slug plugin
JavaScript
7
star
44

co-timeout

co timeout.
JavaScript
7
star
45

carry

Carry over attrs and classes from one element to another.
JavaScript
6
star
46

apool

generic pool
JavaScript
6
star
47

xhr

Cross-browser XMLHttpRequest
JavaScript
6
star
48

editable-placeholder

Editable placeholder a la medium.com
JavaScript
6
star
49

approximate-time

approximate human readable time
JavaScript
6
star
50

send-json

send json across domains and browsers.
JavaScript
6
star
51

delegate-events

delegate events from one emitter to another
JavaScript
6
star
52

sublime-reload

refresh the browser on sublime post save.
Python
5
star
53

set-active

Set document.activeElement.
JavaScript
5
star
54

download

download files with `xhr`, report progress and send the file.
JavaScript
5
star
55

slug

slug component
JavaScript
5
star
56

extensible

extensible constructors
JavaScript
5
star
57

rework-pseudos

rework pseudo elements support.
JavaScript
5
star
58

crop

Image cropper.
JavaScript
5
star
59

emitter-mixin

EventEmitter mixin
JavaScript
5
star
60

uniq

array unique component
JavaScript
4
star
61

sortable-table

Sortable table.
JavaScript
4
star
62

indexof

indexof element.
JavaScript
4
star
63

data

attach data to elements. think $.data()
JavaScript
4
star
64

atkinson

Atkinson can remember form input data across requests.
JavaScript
4
star
65

prevent

Cross browser prevent default, because microsoft is awesome.
JavaScript
4
star
66

placeholder

Placeholder for older browsers.
JavaScript
4
star
67

visibility

Sane page visibility API.
JavaScript
3
star
68

sidebar

sidebar implementation, inspired by OSX notification center.
JavaScript
3
star
69

wd-browser

parse browser names
JavaScript
3
star
70

empty

empty an element.
JavaScript
3
star
71

svg-create

Create svg elements
JavaScript
3
star
72

load-image

load an image
JavaScript
3
star
73

idb-request

Tiny IDBRequest wrapper that allows node style callbacks
JavaScript
3
star
74

unserialize

Unserializes stringified json correctly.
JavaScript
3
star
75

scan-html

tiny html lexer.
JavaScript
3
star
76

fuzzy-object

fuzzy object.
JavaScript
3
star
77

eql

eql utility
JavaScript
3
star
78

function-source

get inner function source
JavaScript
3
star
79

cover-map

map coverage data from yields/instrument
JavaScript
3
star
80

parse-attrs

html attribute parser.
JavaScript
2
star
81

editable-shortcuts

add shortcuts to Editable instance.
JavaScript
2
star
82

buffer-events

Buffer Event Emitter events.
JavaScript
2
star
83

currency

format currency
JavaScript
2
star
84

mixtur

inline css with html
JavaScript
2
star
85

svg-attributes

SVG Attributes
JavaScript
2
star
86

capitalize

if i only had a nickel for every time i wrote this...
JavaScript
2
star
87

hasflash

Wether or not the browser has flash plugin enabled.
JavaScript
2
star
88

merge

merge two objects
JavaScript
2
star
89

get-selected-text

get user selected text
JavaScript
2
star
90

skeleton

skeleton's css
CSS
2
star
91

isArray

es5 isArray
JavaScript
2
star
92

rework-ignore-selectors

Ignore the given selectors.
JavaScript
2
star
93

before

JavaScript
1
star
94

dos-time

get / convert a date to dos-timestamp
JavaScript
1
star
95

normalize-case

tiny utility to normalize case recursively
JavaScript
1
star
96

dos-date

get / convert a date to DOS date.
JavaScript
1
star
97

progress

generic progress emitter.
JavaScript
1
star
98

hms

get hours, minutes and seconds from milliseconds
JavaScript
1
star
99

within

Date(now + 1e3) => "in a second"
JavaScript
1
star
100

wrap

wrap an element
JavaScript
1
star