• Stars
    star
    15
  • Rank 1,371,379 (Top 28 %)
  • Language
    Erlang
  • License
    MIT License
  • Created over 7 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

erlxml - Erlang XML parsing library based on pugixml

erlxml

erlxml - Erlang XML parsing library based on pugixml

Build Status GitHub

Implementation notes

pugixml is the fastest dom parser available in c++ based on the benchmarks available here. The streaming parsing is implemented by splitting the stream into independent stanzas which are parsed using pugixml. The algorithm for splitting is pretty fast but in order to keep it simple as possible adds some limitations at this moment for the streaming mode:

  • not supporting CDATA
  • not supporting comments with special xml characters inside
  • not supporting DOCTYPE

All above limitations applies only to streaming mode and not for DOM parsing mode.

Getting starting:

DOM parsing
erlxml:parse(<<"<foo attr1='bar'>Some Value</foo>">>).

Which results in

{ok,{xmlel,<<"foo">>,
           [{<<"attr1">>,<<"bar">>}],
           [{xmlcdata,<<"Some Value">>}]}}
Generate an XML document from Erlang terms
Xml = {xmlel,<<"foo">>,
    [{<<"attr1">>,<<"bar">>}],  % Attributes
    [{xmlcdata,<<"Some Value">>}]   % Elements
},
erlxml:to_binary(Xml).

Which results in

<<"<foo attr1=\"bar\">Some Value</foo>">>
Streaming parsing
Chunk1 = <<"<stream><foo attr1=\"bar">>,
Chunk2 = <<"\">Some Value</foo></stream>">>,
{ok, Parser} = erlxml:new_stream(),
{ok,[{xmlstreamstart,<<"stream">>,[]}]} = erlxml:parse_stream(Parser, Chunk1),
Rs = erlxml:parse_stream(Parser, Chunk2),
{ok,[{xmlel,<<"foo">>,
        [{<<"attr1">>,<<"bar">>}],
        [{xmlcdata,<<"Some Value">>}]},
     {xmlstreamend,<<"stream">>}]} = Rs.

Options

When you create a stream using new_stream/1 you can specify the following options:

  • stanza_limit - Specify the maximum size a stanza can have. In case the library parses more than this amount of bytes without finding a stanza will return and error {error, {max_stanza_limit_hit, binary()}}. Example: {stanza_limit, 65000}. By default is 0 which means unlimited.

  • strip_non_utf8 - Will strip from attributes values and node values elements all invalid utf8 characters. This is considered user input and might have malformed chars. Default is false.

Benchmarks

The benchmark code is inside the benchmark folder. You need to get exml from Erlang Solutions and fast_xml from ProcessOne as dependencies because all measurements are against this libraries.

All tests are run with 3 concurrency levels (how many erlang processes are spawn)

  • C1 (concurrency level 1)
  • C5 (concurrency level 5)
  • C10 (concurrency level 10)
DOM parsing

Parse the same stanza defined in benchmark/benchmark.erl for 600000 times:

benchmark:bench_parsing(erlxml|exml, 600000, 1|5|10).
Library C1 (ms) C5 (ms) C10 (ms)
erlxml 1942.053 511.619 522.938
exml 1847.879 523.957 567.417
fast_xml 21153.454 5584.026 5812.703

Note:

  • Starting version 3.0.0, exml improved a lot by replacing Expat with RapidXML, for example before this switch the results were 26704.861 ms for C1, 7094.698 for C5 and 5812.703 for C10. Now results are comparable with erlxml.
  • Difference between erlxml and exml performances because is so small we can say they offer same performance from speed point of view.
Generate an XML document from Erlang terms

Encode the same erlang term defined in benchmark/benchmark.erl for 600000 times:

benchmark:bench_encoding(erlxml|exml, 600000, 1|5|10).
Library C1 (ms) C5 (ms) C10 (ms)
erlxml 2285.361 635.57 687.78
exml 2035.966 571.194 603.406
fast_xml 1282.113 361.599 392.007
Streaming parsing

Will load all stanza's from a file and run the parsing mode over that stanza's for 30000 times (total bytes processed in my test is around 1.38 GB) :

benchmark_stream:bench(exml, "/Users/silviu/Desktop/example.txt", 30000, 1).
### 32933.493 ms 42.90 MB/sec total bytes processed: 1.38 GB
benchmark_stream:bench(exml, "/Users/silviu/Desktop/example.txt", 30000, 5).
### 9231.714 ms 153.05 MB/sec total bytes processed: 1.38 GB
benchmark_stream:bench(exml, "/Users/silviu/Desktop/example.txt", 30000, 10).
### 9693.043 ms 145.77 MB/sec total bytes processed: 1.38 GB

benchmark_stream:bench(erlxml, "/Users/silviu/Desktop/example.txt", 30000, 1). 
### 10580.888 ms 133.53 MB/sec total bytes processed: 1.38 GB
benchmark_stream:bench(erlxml, "/Users/silviu/Desktop/example.txt", 30000, 5).
### 2662.581 ms 530.66 MB/sec total bytes processed: 1.38 GB
benchmark_stream:bench(erlxml, "/Users/silviu/Desktop/example.txt", 30000, 10).
### 2579.559 ms 547.74 MB/sec total bytes processed: 1.38 GB
Library C1 (MB/s) C5 (MB/s) C10 (MB/s)
erlxml 133.53 530.66 547.74
exml 42.90 153.05 145.77

Notes:

  • Starting version 3.0.0, exml improved by replacing Expat with RapidXML with arount 27% but is still way behind erlxml.

More Repositories

1

erlkaf

Erlang kafka driver based on librdkafka
Erlang
83
star
2

erlcass

High-Performance Erlang Cassandra driver based on DataStax cpp-driver
Erlang
76
star
3

epqueue

A high performant Erlang NIF Priority Queue implemented using a binary heap
Erlang
21
star
4

erlpool

Erlang round-robin load balancer for Erlang processes based on ETS
Erlang
20
star
5

ezlib

Erlang zlib NIF library optimized for streaming
Erlang
18
star
6

ezstd

Zstd binding for Erlang
Erlang
17
star
7

erltls

TLS/SSL/DTLS BoringSSL/OpenSSL-based NIF implementation of Erlang ssl module
Erlang
13
star
8

mysql_pool

Erlang mysql pool based on mysql-otp and pooler
Erlang
12
star
9

eredis_pool

Erlang pool for Redis with consistent hashing support
Erlang
12
star
10

uuid2bin

MySQL UDF functions implemented in C++ for storing UUID's in a optimal way
C++
9
star
11

ebeanstalkd

A high performant Erlang client for beanstalkd work queue
Erlang
7
star
12

erl_hash

Erlang collection of different hash algorithms
Erlang
7
star
13

unix_timestamp_ms

MySQL UDF function implemented in C++ for getting the value of a datetime field in miliseconds.
C++
6
star
14

elocaltime

Erlang library for conversion from one local time to another based on google cctz
Erlang
6
star
15

tls_bench

A framework for load testing the Erlang TLS and TCP libs
Erlang
6
star
16

erl_deployment

Erlang tiny framework for template configs and deb packages
Erlang
4
star
17

graylog_lager

Erlang Lager backend for graylog server
Erlang
4
star
18

esmpplib

Erlang SMPP client
Erlang
4
star
19

etrie

A Fast and Memory-Efficient Erlang HAT-Trie Implementation Based on Tessil hat-trie.
C++
4
star
20

erlcard

Erlang credit card validation library
Erlang
3
star
21

erluap

Erlang implementation of ua-parser (user agent parser)
Erlang
3
star
22

unicode2gsm

Transliterates Unicode characters outside of GSM alphabet with a similar GSM-encoded character
C++
3
star
23

redis_pool

A high performant Erlang pool for eredis based on erlpool
Erlang
3
star
24

refererparser

Erlang library for extracting marketing attribution data from referrer URLs
Erlang
3
star
25

beanstalkd-consumer

Erlang consumer framework for beanstalkd work queue
Erlang
2
star
26

esvm

An erlang library for Support Vector Machine (SVM) classification and regression
C++
1
star
27

beanstalkd-deb

Ant script for generating a deb package for beanstalkd
Shell
1
star
28

elogsene_logger

An Erlang handler for logger to push logs in Sematext Logsene
Erlang
1
star
29

eunicode2gsm

Erlang library for unicode to gsm transliteration
Erlang
1
star