• Stars
    star
    11,370
  • Rank 2,882 (Top 0.06 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 11 years ago
  • Updated 20 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scalable web crawler framework for Java.

logo

Readme in Chinese

Maven Central License Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>${webmagic.version}</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>${webmagic.version}</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

image

There are more examples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642 542327088

Related Project

  • Gather Platform

    A web console based on WebMagic for Spider configuration and management.

More Repositories

1

tiny-spring

A tiny IoC container refer to Spring.
Java
4,037
star
2

netty-learning

Netty learning.
Java
3,546
star
3

jsoup-learning

Jsoup学习笔记。添加了部分学习代码和注释。
Java
636
star
4

xsoup

When jsoup meets XPath.
Java
464
star
5

hello-design-pattern

Hello world using all 23 kinds of GoF design patterns.
Java
390
star
6

blackhole

A simple unrecursive DNS server. It can easily be configured to intercept some kind of request to one address.
Java
242
star
7

os-learning

一个Java码农的Linux内核学习
C
23
star
8

hostd

Tools to custom your domain resolved rules. Used BlackHole as DNS server.
JavaScript
19
star
9

lucene-learning

Lucene learning.
14
star
10

pdnsd

fork of pdnsd https://gitorious.org/pdnsd
C
14
star
11

netty-servlet

A tiny servlet container using netty.
Java
13
star
12

spring-learning

Spring源码阅读笔记,针对2.5.6
Java
12
star
13

mocksocks

A socks proxy for network monitor.
Java
12
star
14

express.java

A tiny RESTful web framework with embed server. Used as instead of JMX for cross-language communication.
Java
12
star
15

moonlink

A short url service based on OpenResty and redis.
Lua
10
star
16

jsocks

Socks server in Java. Mirror of jsocks in googlecode. Change builder from ANT to maven.
Java
10
star
17

labpages

Pages hooks for gitlab.
Ruby
8
star
18

blackhole-bin

Binary distribution backup of blackhole
Shell
7
star
19

FizzBuzzWhizz

Practice in OOP for thoughtworks quiz FizzBuzzWhizz.
Java
5
star
20

termblog

My oschina blog with jsterm.
Java
5
star
21

classic-algorithms

classic algorithms implements by Java. Just for practice.
Java
5
star
22

tavern

根据jar包进行Web项目模块化与集成的工具。
Java
5
star
23

monkeysocks

A socks proxy in Java. It can be used to record network traffics and replay them for tests.
Java
5
star
24

freemarker-learning

Freemarker 学习笔记。
Java
5
star
25

imgcrawler

imgcrawler是一个把电商网站的搜索结果抓取过来并且集中到网页展示的工具,用途?不晓得,其实这是一个培训的作业,因为实现的比较完整,就给传上来了。
5
star
26

code4craft.github.com

Life is to explore.
HTML
4
star
27

wifesays

Wifesays is a socket listener in Java program. It listens what wife says and notify all the workers!
Java
4
star
28

abc

'A'nother 'B'ean 'C'opier.
Java
4
star
29

reviewbot

gitlab防呆小工具,自动帮你修正2B代码。
JavaScript
4
star
30

soa-research

SOA环境下服务治理的研究。
3
star
31

xpathmagic

A chrome plugin to get XPath of elements.
3
star
32

groovy-learning

Practice codes in groovy
Groovy
3
star
33

tinycat

A tiny web container refer to Tomcat
3
star
34

leetcode

Solutions for https://oj.leetcode.com/
Java
3
star
35

bigdata-learning

3
star
36

forger

Dynamic Java object generator with template class and configuration.
Java
3
star
37

dp-idea

Idea plugin for dianping.
Java
3
star
38

coursera

Just coursera notes.
2
star
39

exciting

A chrome plugin to watch your new stars! Exciting!
JavaScript
2
star
40

java-facilities

Examples of java facilities. Such as JVM serializers, template engines.
Java
2
star
41

daogen

Dao generator for java.
JavaScript
2
star
42

MemoriesOn

记录见识的地方,类似 http://see.sl088.com/
2
star
43

jdk-learning

Java 并发学习导论。
Java
2
star
44

codecraft

codecraft repo
Java
2
star
45

mockmoon

A simple lua extension based on openresty. I can mock specific file to specific url.
Lua
2
star
46

ibatis-plugin

iBATIS plugin is aimed to accelerate iBATIS development in IntelliJ IDEA. Mirror of https://code.google.com/p/ibatis-plugin .
Java
2
star
47

js-learning

1
star
48

gugugua-dianconvertor

gugugua-dianconvertor is a simple tool to convert diandian backup xml file to wordpress xml file. Now only support text type file.
Groovy
1
star
49

phantomJava

A headless WebKit scriptable with a Java API.
1
star
50

dp-alfred-workflow

Alfred workflow for dianping.
1
star
51

mocksocks-html

Web panel of mocksocks with fashional front end techs.
JavaScript
1
star
52

sqlparser

A simple sqlparser.
Java
1
star
53

my-tech-radar

我的新技术雷达。
1
star
54

csapp-learning

深入理解计算机系统读书笔记
1
star
55

war4e

@deperated, see jetty-runner http://www.eclipse.org/jetty/documentation/current/jetty-runner.html
Java
1
star
56

textmagic

Textmagic is a text extractor with a powerful expression language to config.
Java
1
star
57

hessian-blacklist

Hessian2中一些无法正常序列化/反序列化的类。
Java
1
star
58

hello-ai

1
star
59

spring-practice

My spring best practice.
Java
1
star
60

imcaptcha

Captcha by image distortion.
Java
1
star