• Stars
    star
    428
  • Rank 101,481 (Top 2 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created about 6 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Represent an XML document as a read-only tree.

roxmltree

Build Status Crates.io Documentation Rust 1.36+

Represents an XML 1.0 document as a read-only tree.

// Find element by id.
let doc = roxmltree::Document::parse("<rect id='rect1'/>")?;
let elem = doc.descendants().find(|n| n.attribute("id") == Some("rect1"))?;
assert!(elem.has_tag_name("rect"));

Why read-only?

Because in some cases all you need is to retrieve some data from an XML document. And for such cases, we can make a lot of optimizations.

As for roxmltree, it's fast not only because it's read-only, but also because it uses xmlparser, which is many times faster than xml-rs. See the Performance section for details.

Parsing behavior

Sadly, XML can be parsed in many different ways. roxmltree tries to mimic the behavior of Python's lxml. But unlike lxml, roxmltree does support comments outside the root element.

For more details see docs/parsing.md.

Alternatives

Feature/Crate roxmltree libxml2 xmltree sxd-document
Element namespace resolving βœ“ βœ“ βœ“ ~1
Attribute namespace resolving βœ“ βœ“ βœ“
Entity references βœ“ βœ“ Γ— Γ—
Character references βœ“ βœ“ βœ“ βœ“
Attribute-Value normalization βœ“ βœ“
Comments βœ“ βœ“ βœ“
Processing instructions βœ“ βœ“ βœ“ βœ“
UTF-8 BOM βœ“ βœ“ Γ— Γ—
Non UTF-8 input βœ“
Complete DTD support βœ“
Position preserving2 βœ“ βœ“
HTML support βœ“
Tree modification βœ“ βœ“ βœ“
Writing βœ“ βœ“ βœ“
No unsafe βœ“ βœ“
Language Rust C Rust Rust
Size overhead4 ~55KiB ~1.4MiB5 ~78KiB ~102KiB
Dependencies 1 ?5 2 2
Tested version 0.18.0 2.9.8 0.10.2 0.3.2
License MIT / Apache-2.0 MIT MIT MIT

Legend:

  • βœ“ - supported
  • Γ— - parsing error
  • ~ - partial
  • nothing - not supported

Notes:

  1. No default namespace propagation.
  2. roxmltree keeps all node and attribute positions in the original document, so you can easily retrieve it if you need it. See examples/print_pos.rs for details.
  3. In the memchr crate.
  4. Binary size overhead according to cargo-bloat.
  5. Depends on build flags.

There is also elementtree and treexml crates, but they are abandoned for a long time.

Performance

Parsing

test huge_roxmltree      ... bench:   3,152,020 ns/iter (+/- 38,556)
test huge_libxml         ... bench:   6,779,906 ns/iter (+/- 184,744)
test huge_sdx_document   ... bench:   8,289,337 ns/iter (+/- 378,131)
test huge_xmltree        ... bench:  45,309,549 ns/iter (+/- 1,591,562)

test large_roxmltree     ... bench:   1,568,688 ns/iter (+/- 9,956)
test large_libxml        ... bench:   3,199,587 ns/iter (+/- 139,486)
test large_sdx_document  ... bench:   3,731,708 ns/iter (+/- 92,787)
test large_xmltree       ... bench:  15,605,566 ns/iter (+/- 331,504)

test medium_roxmltree    ... bench:     430,778 ns/iter (+/- 18,070)
test medium_libxml       ... bench:     932,408 ns/iter (+/- 8,763)
test medium_sdx_document ... bench:   1,452,152 ns/iter (+/- 54,983)
test medium_xmltree      ... bench:   4,903,558 ns/iter (+/- 116,875)

test tiny_roxmltree      ... bench:       2,630 ns/iter (+/- 41)
test tiny_libxml         ... bench:       9,113 ns/iter (+/- 183)
test tiny_sdx_document   ... bench:      10,388 ns/iter (+/- 116)
test tiny_xmltree        ... bench:      22,067 ns/iter (+/- 228)

roxmltree uses xmlparser internally, while sdx-document uses its own implementation, xmltree uses the xml-rs. Here is a comparison between xmlparser, xml-rs and quick-xml:

test huge_xmlparser      ... bench:   1,744,585 ns/iter (+/- 28,509)
test huge_quick_xml      ... bench:   2,818,954 ns/iter (+/- 66,923)
test huge_xmlrs          ... bench:  41,072,412 ns/iter (+/- 519,803)

test large_xmlparser     ... bench:     756,125 ns/iter (+/- 13,995)
test large_quick_xml     ... bench:   1,401,189 ns/iter (+/- 28,295)
test large_xmlrs         ... bench:  12,920,333 ns/iter (+/- 143,508)

test medium_quick_xml    ... bench:     216,080 ns/iter (+/- 5,479)
test medium_xmlparser    ... bench:     258,587 ns/iter (+/- 3,684)
test medium_xmlrs        ... bench:   4,629,016 ns/iter (+/- 109,023)

test tiny_xmlparser      ... bench:       1,087 ns/iter (+/- 16)
test tiny_quick_xml      ... bench:       2,420 ns/iter (+/- 51)
test tiny_xmlrs          ... bench:      18,974 ns/iter (+/- 162)

Iteration


test roxmltree_iter_descendants_expensive   ... bench:     255,261 ns/iter (+/- 1,424)
test xmltree_iter_descendants_expensive     ... bench:     354,316 ns/iter (+/- 3,383)

test roxmltree_iter_descendants_inexpensive ... bench:      20,736 ns/iter (+/- 218)
test xmltree_iter_descendants_inexpensive   ... bench:     125,849 ns/iter (+/- 1,200)

test roxmltree_iter_children                ... bench:       1,409 ns/iter (+/- 54)

Where expensive refers to the matching done on each element. In these benchmarks, expensive means searching for any node in the document which contains a string. And inexpensive means searching for any element with a particular name.

Notes

The benchmarks were taken on a Apple M1 Pro. You can try running the benchmarks yourself by running cargo bench in the benches dir.

  • Since all libraries have a different XML support, benchmarking is a bit pointless.
  • Tree crates may use different xml-rs crate versions.
  • We bench libxml2 using the rust-libxml wrapper crate
  • quick-xml is faster than xmlparser because it's more forgiving for the input, while xmlparser is very strict and does a lot of checks, which are expensive. So performance difference is mainly due to validation.

Memory Overhead

roxmltree tries to use as little memory as possible to allow parsing very large (multi-GB) XML files.

The peak memory usage doesn't directly correlates with the file size but rather with the amount of nodes and attributes a file has. How many attributes had to be normalized (i.e. allocated). And how many text nodes had to be preprocessed (i.e. allocated).

roxmltree never allocates element and attribute names, processing instructions and comments.

By disabling the positions feature, you can shave by 8 bytes from each node and attribute.

On average, the overhead is around 6-8x the file size. For example, our 1.1GB sample XML will peak at 7.6GB RAM with default features enabled and at 6.8GB RAM when positions is disabled.

Safety

  • This library must not panic. Any panic should be considered a critical bug and reported.
  • This library forbids unsafe code.

Non-goals

  • Complete XML support.
  • Tree modification and writing.
  • XPath/XQuery.

API

This library uses Rust's idiomatic API based on iterators. In case you are more familiar with browser/JS DOM APIs - you can check out tests/dom-api.rs to see how it can be converted into a Rust one.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

More Repositories

1

resvg

An SVG rendering library.
Rust
2,684
star
2

cargo-bloat

Find out what takes most of the space in your executable.
Rust
2,295
star
3

svgcleaner

svgcleaner could help you to clean up your SVG files from the unnecessary data.
Rust
1,621
star
4

tiny-skia

A tiny Skia subset ported to Rust
Rust
1,089
star
5

ttf-parser

A high-level, safe, zero-allocation TrueType font parser.
Rust
609
star
6

pico-args

An ultra simple CLI arguments parser.
Rust
560
star
7

rustybuzz

A complete harfbuzz's shaping algorithm port to Rust
Rust
533
star
8

svgcleaner-gui

GUI for svgcleaner.
C++
306
star
9

fontdb

A simple, in-memory font database with CSS-like queries.
Rust
136
star
10

xmlparser

A low-level, pull-based, zero-allocation XML 1.0 parser.
Rust
130
star
11

color-thief-rs

Grabs the dominant color or a representative color palette from an image.
Rust
75
star
12

svgtypes

A collection of parsers for SVG types.
Rust
67
star
13

rctree

A "DOM-like" tree implemented using reference counting
Rust
37
star
14

simplecss

A simple CSS 2.1 parser and selector
Rust
35
star
15

svgdom

Library to represent an SVG as a DOM.
Rust
31
star
16

ttf-explorer

A simple tool to explore a TrueType font content as a tree
C++
30
star
17

xmlwriter

A simple, streaming XML writer.
Rust
24
star
18

resvg-test-suite

resvg test suite
C++
23
star
19

svgparser

(DEPRECATED) Featureful, pull-based, zero-allocation SVG parser.
Rust
22
star
20

notes-on-svg-parsing

Notes on SVG parsing
21
star
21

barh

A simple horizontal bar chart generator.
Rust
9
star
22

strict-num

A collection of bounded numeric types.
Rust
8
star
23

unicode-vo

Unicode vertical orientation detection
Rust
6
star
24

unicode-ccc

Unicode Canonical Combining Class detection
Rust
4
star
25

unicode-bidi-mirroring

Unicode Bidi Mirroring property detection
Rust
3
star
26

RazrFalcon

2
star
27

stb_truetype_meson

C
1
star