• Stars
    star
    137
  • Rank 260,335 (Top 6 %)
  • Language
    Rust
  • License
    MIT License
  • Created over 2 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html).infallible() {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", String::from_utf8_lossy(&hello_world)).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", String::from_utf8_lossy(&tag.name)).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

What a tokenizer does and what it does not do

html5gum fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer test suite. Since it is just a tokenizer, this means:

  • html5gum does not implement charset detection. This implementation takes and returns bytes, but assumes UTF-8. It recovers gracefully from invalid UTF-8.
  • html5gum does not correct mis-nested tags.
  • html5gum does not recognize implicitly self-closing elements like <img>, as a tokenizer it will simply emit a start token. It does however emit a self-closing tag for <img .. />.
  • html5gum doesn't implement the DOM, and unfortunately in the HTML spec, constructing the DOM ("tree construction") influences how tokenization is done. For an example of which problems this causes see this example code.
  • html5gum does not generally qualify as a browser-grade HTML parser as per the WHATWG spec. This can change in the future, see issue 21.

With those caveats in mind, html5gum can pretty much parse tokenize anything that browsers can.

The Emitter trait

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

  • Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.

  • Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

See the custom_emitter example for how this looks like in practice.

Other features

  • No unsafe Rust
  • Only dependency is jetscii, and can be disabled via crate features (see Cargo.toml)

Alternative HTML parsers

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

  • use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

    For my own usecase html5gum is about 2x slower than quick-xml.

  • use html5ever's own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).

  • use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn't manage to get working for my usecase.

Etymology

Why is this library called html5gum?

  • G.U.M: Giant Unreadable Match-statement

  • <insert "how it feels to chew 5 gum parse HTML" meme here>

License

Licensed under the MIT license, see ./LICENSE.

More Repositories

1

python-atomicwrites

Powerful Python library for atomic file writes.
Python
315
star
2

quickenv

An unintrusive environment manager
Rust
124
star
3

hyperlink

Very fast link checker for CI.
Rust
112
star
4

mysteryshack

A remoteStorage-server
Rust
112
star
5

rust-atomicwrites

Atomic file-writes.
Rust
81
star
6

spacemod

A easy to understand and powerful text search-and-replace tool
Rust
39
star
7

mastodon-bookmark-rss

A small app to let you connect your mastodon bookmarks to your RSS reader.
Rust
22
star
8

script-macro

Write simple proc-macros inline with other source code.
Rust
19
star
9

pytest-subtesthack

A hack to explicitly set up and tear down fixtures.
Python
17
star
10

shippai

Use Rust failures as Python exceptions
Python
16
star
11

rust-vobject

VObject parser and generator for Rust
Rust
15
star
12

pytest-fixture-typecheck

A pytest plugin to assert type annotations at runtime.
Python
12
star
13

python-sensitive-variables

strip local variables in tracebacks
Python
10
star
14

watdo

ABANDONED -- A task-manager for the command line.
Python
8
star
15

in-app-browser-framebreaker

HTML
6
star
16

taskrs

A tasks app
JavaScript
6
star
17

iron-login

ABANDONED Basic session management in Iron.
Rust
6
star
18

uberspace-deploy-scripts

Some deployment scripts for uberspace.de
Python
3
star
19

sentry-toolz

Python
3
star
20

python-structural-matching-benchmarks

Python
2
star
21

vdir

2
star
22

dotfiles

My dotfiles
Vim Script
2
star
23

rust-webicon

Favicon and apple-touch-icon scraper for Rust
Rust
2
star
24

fdwalk

Rust
2
star
25

memoria

A bad memory "profiler" for production.
Rust
2
star
26

facebook-delete-messages

Userscript for Facebook Messages to replace the archive button with a delete button. Based on http://userscripts.org/scripts/show/106261
JavaScript
1
star
27

bottom-import-demo

Python
1
star
28

diyrss

A simple feed generator website
Python
1
star
29

quicktype-markdown

Generate Markdown documentation from JSON schema, powered by quicktype
JavaScript
1
star
30

aoc2020

Python
1
star
31

firefox2pass

Migrate passwords from Firefox to passwordstore
Python
1
star
32

python-move-semantics

Python
1
star
33

gitgone

Rust
1
star
34

rust-fake-yield

Simple generators in Rust
Rust
1
star
35

untitaker

1
star
36

maildropper.py

Easy to use mail delivery agent
Python
1
star
37

serde-annotated

Rust
1
star