• Stars
    star
    317
  • Rank 132,216 (Top 3 %)
  • Language
    Java
  • License
    MIT License
  • Created almost 7 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🎊 Design and implement of lightweight crawler framework.

Elves

一个轻量级的爬虫框架设计与实现,博文分析

@biezhi on zhihu

特性

  • 事件驱动
  • 易于定制
  • 多线程执行
  • CSS 选择器和 XPath 支持

Maven 坐标

<dependency>
    <groupId>io.github.biezhi</groupId>
    <artifactId>elves</artifactId>
    <version>0.0.2</version>
</dependency>

如果你想在本地运行这个项目源码,请确保你是 Java8 环境并且安装了 lombok 插件。

架构图

调用流程图

快速上手

搭建一个爬虫程序需要进行这么几步操作

  1. 编写一个爬虫类继承自 Spider
  2. 设置要抓取的 URL 列表
  3. 实现 Spiderparse 方法
  4. 添加 Pipeline 处理 parse 过滤后的数据

举个栗子:

public class DoubanSpider extends Spider {

    public DoubanSpider(String name) {
        super(name);
        this.startUrls(
            "https://movie.douban.com/tag/爱情",
            "https://movie.douban.com/tag/喜剧",
            "https://movie.douban.com/tag/动画",
            "https://movie.douban.com/tag/动作",
            "https://movie.douban.com/tag/史诗",
            "https://movie.douban.com/tag/犯罪");
    }

    @Override
    public void onStart(Config config) {
        this.addPipeline((Pipeline<List<String>>) (item, request) -> log.info("保存到文件: {}", item));
    }

    public Result parse(Response response) {
        Result<List<String>> result   = new Result<>();
        Elements             elements = response.body().css("#content table .pl2 a");

        List<String> titles = elements.stream().map(Element::text).collect(Collectors.toList());
        result.setItem(titles);

        // 获取下一页 URL
        Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
        if (null != nextEl && nextEl.size() > 0) {
            String  nextPageUrl = nextEl.get(0).attr("href");
            Request nextReq     = this.makeRequest(nextPageUrl, this::parse);
            result.addRequest(nextReq);
        }
        return result;
    }

}

public static void main(String[] args) {
    DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
    Elves.me(doubanSpider, Config.me()).start();
}

爬虫例子

开源协议

MIT

More Repositories

1

java-bible

🍌 我的技术摘要
HTML
2,984
star
2

30-seconds-of-java8

30 seconds to collect useful Java 8 snippet.
Java
2,368
star
3

wechat-api

🗯 wechat-api by java7.
Java
1,814
star
4

learn-java8

💖《跟上 Java 8》视频课程源码
Java
1,382
star
5

oh-my-email

📪 可能是最小的 Java 邮件发送库了,支持抄送、附件、模板等功能。
Java
711
star
6

profit

🤔 biezhi 在线打赏系统,开启你的要饭生涯。
Java
404
star
7

write-readable-code

🗾 编写可读代码的艺术代码仓库
Java
401
star
8

anima

Minimal database operation library.
Java
227
star
9

java-library-examples

💪 example of common used libraries and frameworks, programming required, don't fork man.
Java
215
star
10

excel-plus

❇️ Improve the productivity of the Excel operation library. https://hellokaton.github.io/excel-plus/#/
Java
191
star
11

gorm-paginator

gorm pagination extension
Go
154
star
12

java-tips

🍓 Java 语言编程技巧、最佳实践
Java
149
star
13

gitmoji-plugin

Choose the right emoji emoticon for git commit, make git log commit more interesting.
Kotlin
112
star
14

java11-examples

java 11 example code
Java
98
star
15

telegram-bot-api

🤖 telegram bot api by java, help you quickly create a little robot.
Java
85
star
16

code-fonts

编程语言中流行的 8 款代码字体
PHP
63
star
17

lets-golang

《给 Java 程序员的 Go 私房菜》源码仓库
Go
62
star
18

oh-mybatis

🎈 A simple web app to generate mybatis code.
Java
60
star
19

terse

🍋 my typecho blog theme, Concise UI
CSS
59
star
20

hellokaton.com

✍️ https://hellokaton.com source code
HTML
46
star
21

webp-io

🌚 general format images and webp transform each other
Java
44
star
22

redis-dqueue

redis base delay queue
Java
44
star
23

oh-my-request

🔮 simple request library by java8
Java
43
star
24

oh-my-session

🍖 distributed session storage scheme, using redis to store data.
Java
42
star
25

swing-generate

🙊 Swing development code generator
Java
36
star
26

ss-panel

🐒 java shadowsocks panel(java版本shadowsocks面板)
HTML
34
star
27

industry-glossary

汇集行业英文术语,让你命名不在困难。
33
star
28

freechat

🐶 online anonymous chat application.
CSS
29
star
29

grice

🐜 A simple web document engine written by java.
CSS
29
star
30

goinx

💞 Multi-domain proxy by golang, fuck gfw proxy
Go
28
star
31

springmvc-plupload

🍃 Springmvc and servlet shard to upload demo
JavaScript
27
star
32

lowb

🤦🏻‍♂️ teach you to develop command-line source code using NodeJS
JavaScript
27
star
33

geekbb

😎 Geek dev club.
JavaScript
26
star
34

keeper

Java
25
star
35

mini-jq

🍣 mininal jquery, you don't need jquery!
JavaScript
21
star
36

spring-boot-examples

🕷 spring boot and spring cloud examples
Java
20
star
37

bye-2017

👋 Bye! my 2017
CSS
19
star
38

weather-cli

💊 weather command-line programs written in Golang.
Go
18
star
39

go-examples

🍄 learning golang code
Go
16
star
40

oh-my-jvm

☕️ using golang write jvm
Go
16
star
41

nice

🐹 使用blade开发的一款简洁的图片社交应用
Java
16
star
42

blog

blog source code
SCSS
16
star
43

learn-cute-netty

《可爱的Netty》源码仓库
Java
15
star
44

eve

👻 everyday explore, Github / HackNews / V2EX / Medium / Product Hunt.
Go
13
star
45

blade-cli

🐳 blade mvc cli application
Go
13
star
46

agon

🦉 my golang utilities, log json config and other
Go
12
star
47

wechat-api-examples

wechat-api examples, used java8.
Java
12
star
48

telegram-lottery

🤖 A telegram lottery bot that helps you achieve random rewards for group activities.
Go
11
star
49

java8-best-practices

java8 best practices source code.
Java
11
star
50

runcat_pyside

Python PySide6 of RunCat_for_windows
Python
10
star
51

dbkit

🚧 convenient implement of Database timing backup tool.
Go
10
star
52

lets-python

🍭 learning Python code
Python
9
star
53

java-2048

🎲 Java swing of 2048
Java
9
star
54

mapper-spring-boot-starter

通用mapper+pagehelper的spring boot starter
Java
9
star
55

blade-lattice

🔐 lightweight authentication and authorization, born for Blade.
Java
9
star
56

writty

🌝 A garbage writing platform, has given up.
Java
9
star
57

java-clojure-syntax-comparison

⚔️ Comparison of some code snippets in the Java and Clojure.
9
star
58

emojis

this is emoji images
8
star
59

ppocr-api

飞桨 OCR API Docker 镜像
Python
7
star
60

show-code

🌝 show github repositories card with html
JavaScript
7
star
61

probe

exploring the world beyond the wall, netty4 based proxy service.
Java
7
star
62

oh-my-monitor

🌈 C/S mode server monitoring (discontinuation of maintenance)
Java
6
star
63

moe

😛 simple cli spinner by golang.
Go
5
star
64

witty

🐝 witty is a smart golang robot
Go
5
star
65

gow

🙄 gow!!! a micro go web framework.
Go
5
star
66

primer-series

从入门到放弃系列衬衫生成器
CSS
5
star
67

spring-boot-starter-jetbrick

🐥 jetbrick-template的spring-boot starter
5
star
68

fastapi-tutorial

Here are some sample programs for me to learn fastapi.
Python
5
star
69

helidon-examples

💣 helidon framework example
Java
5
star
70

hellokaton

3
star
71

bar

🍺 static website builder written by golang
Go
3
star
72

tomato-clock

JavaScript
2
star
73

findor

🌟 let the world find you !
2
star
74

vite-react-mpa-example

React + Vite multi page application example
TypeScript
2
star
75

typecho-theme-modernist

a typecho theme modernist
PHP
2
star
76

homebrew-tap

Homebrew tap for biezhi/eve
Ruby
1
star
77

clojure-todo-list

The todo-list application that used clojure + ring
Clojure
1
star
78

juejun-alfred-workflow

a simple juejin alfred workflow example
Python
1
star