Collyzar

A distributed redis-based framework for colly.

Collyzar provides a very simple configuration and tools to implement distributed crawling/scraping.

Features

Simple configuration and clean API
Distributed crawling/scraping
Built-in global bloom filter
Built-in spider cache
Support redis command
Multi-machine load balancing
Support to pause or stop all crawling machines
Pass additional information to the crawler and get it inside the crawler and store it in the database

Installation

Add collyzar to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/Zartenc/collyzar/v2 latest
)

Example Usage

See examples folder for more detailed examples.

Crawler cluster machine

SpiderName must be unique.

After running, it will always monitor the redis crawler queue for crawling until it receives a pause or stop signal.

func main(){
    cs := &collyzar.CollyzarSettings{
    		SpiderName: "zarten",
    		Domain:     "www.amazon.com",
    		RedisIp:    "127.0.0.1",
    	}
	collyzar.Run(myResponse, cs, nil)
}

func myResponse(response *collyzar.ZarResponse){
	fmt.Println(response.StatusCode)
}

Control machine

Push url to redis queue

func main(){
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	url := "https://www.amazon.com"
	pushInfo := collyzar.PushInfo{Url:url}

	err := ts.PushToQueue(pushInfo)
	if err != nil{
		fmt.Println(err)
	}
}

Tools

Provide tools including stop crawlers and pause crawlers.

Stop all crawlers

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.StopSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Pause all crawlers

For all crawlers, the crawler process is idle after pausing the crawler.
Then you can use the WakeupSpiders method to wake up the crawlers.

func main() {
	ts := collyzar.NewToolSpider("127.0.0.1", 6379, "", "zarten")

	err := ts.PauseSpiders()
	if err != nil{
		fmt.Println(err)
	}
}

Bugs

Bugs or suggestions? Visit the issue tracker

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow").

Zartenc/collyzar

Zartenc

Reviews

Repository Details