• Stars
    star
    636
  • Rank 70,723 (Top 2 %)
  • Language
    Java
  • License
    MIT License
  • Created about 11 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Jsoup学习笔记。添加了部分学习代码和注释。

Jsoup学习笔记

Jsoup是Java世界的一款HTML解析工具,它支持用CSS Selector方式选择DOM元素,也可过滤HTML文本,防止XSS攻击。

学习Jsoup是为了更好的开发我的另一个爬虫框架webmagic,为了学的比较详细,就强制自己用很规范的方式写出这部分文章。

代码部分来自https://github.com/jhy/jsoup,添加了一些中文注释以及示例代码。


提纲

  1. 概述

  2. DOM相关对象

  3. Document的输出

  4. HTML语法分析parser

    1. 语法分析与状态机基础
    2. 词法分析Tokenizer
    3. 语法检查及DOM树构建
  5. CSS Selector

  6. 防御XSS攻击

  7. 为Jsoup增加XPath选择功能

    Jsoup默认没有XPath功能,我写了一个项目Xsoup,可以使用XPath来选择HTML文本。Java里较常用的XPath抽取器是HtmlCleaner,Xsoup的性能比它快了一倍。


协议:

相关代码遵循MIT协议。

文档遵循CC-BYNC协议。

Bitdeli Badge

More Repositories

1

webmagic

A scalable web crawler framework for Java.
Java
11,370
star
2

tiny-spring

A tiny IoC container refer to Spring.
Java
4,037
star
3

netty-learning

Netty learning.
Java
3,546
star
4

xsoup

When jsoup meets XPath.
Java
464
star
5

hello-design-pattern

Hello world using all 23 kinds of GoF design patterns.
Java
390
star
6

blackhole

A simple unrecursive DNS server. It can easily be configured to intercept some kind of request to one address.
Java
242
star
7

os-learning

一个Java码农的Linux内核学习
C
23
star
8

hostd

Tools to custom your domain resolved rules. Used BlackHole as DNS server.
JavaScript
19
star
9

lucene-learning

Lucene learning.
14
star
10

pdnsd

fork of pdnsd https://gitorious.org/pdnsd
C
14
star
11

netty-servlet

A tiny servlet container using netty.
Java
13
star
12

spring-learning

Spring源码阅读笔记,针对2.5.6
Java
12
star
13

mocksocks

A socks proxy for network monitor.
Java
12
star
14

express.java

A tiny RESTful web framework with embed server. Used as instead of JMX for cross-language communication.
Java
12
star
15

moonlink

A short url service based on OpenResty and redis.
Lua
10
star
16

jsocks

Socks server in Java. Mirror of jsocks in googlecode. Change builder from ANT to maven.
Java
10
star
17

labpages

Pages hooks for gitlab.
Ruby
8
star
18

blackhole-bin

Binary distribution backup of blackhole
Shell
7
star
19

FizzBuzzWhizz

Practice in OOP for thoughtworks quiz FizzBuzzWhizz.
Java
5
star
20

termblog

My oschina blog with jsterm.
Java
5
star
21

classic-algorithms

classic algorithms implements by Java. Just for practice.
Java
5
star
22

tavern

根据jar包进行Web项目模块化与集成的工具。
Java
5
star
23

monkeysocks

A socks proxy in Java. It can be used to record network traffics and replay them for tests.
Java
5
star
24

freemarker-learning

Freemarker 学习笔记。
Java
5
star
25

imgcrawler

imgcrawler是一个把电商网站的搜索结果抓取过来并且集中到网页展示的工具,用途?不晓得,其实这是一个培训的作业,因为实现的比较完整,就给传上来了。
5
star
26

code4craft.github.com

Life is to explore.
HTML
4
star
27

wifesays

Wifesays is a socket listener in Java program. It listens what wife says and notify all the workers!
Java
4
star
28

abc

'A'nother 'B'ean 'C'opier.
Java
4
star
29

reviewbot

gitlab防呆小工具,自动帮你修正2B代码。
JavaScript
4
star
30

soa-research

SOA环境下服务治理的研究。
3
star
31

xpathmagic

A chrome plugin to get XPath of elements.
3
star
32

groovy-learning

Practice codes in groovy
Groovy
3
star
33

tinycat

A tiny web container refer to Tomcat
3
star
34

leetcode

Solutions for https://oj.leetcode.com/
Java
3
star
35

bigdata-learning

3
star
36

forger

Dynamic Java object generator with template class and configuration.
Java
3
star
37

dp-idea

Idea plugin for dianping.
Java
3
star
38

coursera

Just coursera notes.
2
star
39

exciting

A chrome plugin to watch your new stars! Exciting!
JavaScript
2
star
40

java-facilities

Examples of java facilities. Such as JVM serializers, template engines.
Java
2
star
41

MemoriesOn

记录见识的地方,类似 http://see.sl088.com/
2
star
42

jdk-learning

Java 并发学习导论。
Java
2
star
43

daogen

Dao generator for java.
JavaScript
2
star
44

codecraft

codecraft repo
Java
2
star
45

mockmoon

A simple lua extension based on openresty. I can mock specific file to specific url.
Lua
2
star
46

ibatis-plugin

iBATIS plugin is aimed to accelerate iBATIS development in IntelliJ IDEA. Mirror of https://code.google.com/p/ibatis-plugin .
Java
2
star
47

js-learning

1
star
48

gugugua-dianconvertor

gugugua-dianconvertor is a simple tool to convert diandian backup xml file to wordpress xml file. Now only support text type file.
Groovy
1
star
49

phantomJava

A headless WebKit scriptable with a Java API.
1
star
50

dp-alfred-workflow

Alfred workflow for dianping.
1
star
51

sqlparser

A simple sqlparser.
Java
1
star
52

mocksocks-html

Web panel of mocksocks with fashional front end techs.
JavaScript
1
star
53

csapp-learning

深入理解计算机系统读书笔记
1
star
54

war4e

@deperated, see jetty-runner http://www.eclipse.org/jetty/documentation/current/jetty-runner.html
Java
1
star
55

my-tech-radar

我的新技术雷达。
1
star
56

textmagic

Textmagic is a text extractor with a powerful expression language to config.
Java
1
star
57

hessian-blacklist

Hessian2中一些无法正常序列化/反序列化的类。
Java
1
star
58

hello-ai

1
star
59

spring-practice

My spring best practice.
Java
1
star
60

imcaptcha

Captcha by image distortion.
Java
1
star