• Stars
    star
    736
  • Rank 59,169 (Top 2 %)
  • Language
    Rust
  • License
    MIT License
  • Created almost 2 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CommonMark compliant markdown parser in Rust with ASTs and extensions





markdown-rs

Build Coverage GitHub docs.rs crates.io

๐Ÿ‘‰ Note: this is a new crate that reuses an old name. The old crate (0.3.0 and lower) has a bunch of problems. Make sure to use the new crate, currently in alpha at 1.0.0-alpha.11.

CommonMark compliant markdown parser in Rust with ASTs and extensions.

Feature highlights

  • compliant (100% to CommonMark)
  • extensions (100% GFM, 100% MDX, frontmatter, math)
  • safe (100% safe Rust, also 100% safe HTML by default)
  • robust (2300+ tests, 100% coverage, fuzz testing)
  • ast (mdast)

When should I use this?

  • If you just want to turn markdown into HTML (with maybe a few extensions)
  • If you want to do really complex things with markdown

What is this?

markdown-rs is an open source markdown parser written in Rust. Itโ€™s implemented as a state machine (#![no_std] + alloc) that emits concrete tokens, so that every byte is accounted for, with positional info. The API then exposes this information as an AST, which is easier to work with, or it compiles directly to HTML.

While most markdown parsers work towards compliancy with CommonMark (or GFM), this project goes further by following how the reference parsers (cmark, cmark-gfm) work, which is confirmed with thousands of extra tests.

Other than CommonMark and GFM, this project also supports common extensions to markdown such as MDX, math, and frontmatter.

This Rust crate has a sibling project in JavaScript: micromark (and mdast-util-from-markdown for the AST).

P.S. if you want to compile MDX, use mdxjs-rs.

Questions

Contents

Install

With Rust (rust edition 2018+, ยฑversion 1.56+), install with cargo:

๐Ÿ‘‰ Note: this is a new crate that reuses an old name. The old crate (0.3.0 and lower) has a bunch of problems. Make sure to use the new crate, currently in alpha at 1.0.0-alpha.11.

Use

fn main() {
    println!("{}", markdown::to_html("## Hello, *world*!"));
}

Yields:

<h2>Hello, <em>world</em>!</h2>

Extensions (in this case GFM):

fn main() -> Result<(), String> {
    println!(
        "{}",
        markdown::to_html_with_options(
            "* [x] [email protected] ~~strikethrough~~",
            &markdown::Options::gfm()
        )?
    );

    Ok(())
}

Yields:

<ul>
  <li>
    <input checked="" disabled="" type="checkbox" />
    <a href="mailto:[email protected]">[email protected]</a>
    <del>strikethrough</del>
  </li>
</ul>

Syntax tree (mdast):

fn main() -> Result<(), String> {
    println!(
        "{:?}",
        markdown::to_mdast("# Hey, *you*!", &markdown::ParseOptions::default())?
    );

    Ok(())
}

Yields:

Root { children: [Heading { children: [Text { value: "Hey, ", position: Some(1:3-1:8 (2-7)) }, Emphasis { children: [Text { value: "you", position: Some(1:9-1:12 (8-11)) }], position: Some(1:8-1:13 (7-12)) }, Text { value: "!", position: Some(1:13-1:14 (12-13)) }], position: Some(1:1-1:14 (0-13)), depth: 1 }], position: Some(1:1-1:14 (0-13)) }

API

markdown-rs exposes to_html, to_html_with_options, to_mdast, Options, and a few other structs and enums.

See the crate docs for more info.

Extensions

markdown-rs supports extensions to CommonMark. These extensions are maintained in this project. They are not enabled by default but can be turned on with options.

  • frontmatter
  • GFM
    • autolink literal
    • footnote
    • strikethrough
    • table
    • tagfilter
    • task list item
  • math
  • MDX
    • ESM
    • expressions
    • JSX

It is not a goal of this project to support lots of different extensions. Itโ€™s instead a goal to support very common and mostly standardized extensions.

Project

markdown-rs is maintained as a single monolithic crate.

Overview

The process to parse markdown looks like this:

                    markdown-rs
+-------------------------------------------------+
|            +-------+         +---------+--html- |
| -markdown->+ parse +-events->+ compile +        |
|            +-------+         +---------+-mdast- |
+-------------------------------------------------+

File structure

The files in src/ are as follows:

  • construct/*.rs โ€” CommonMark, GFM, and other extension constructs used in markdown
  • util/*.rs โ€” helpers often needed when parsing markdown
  • event.rs โ€” things with meaning happening somewhere
  • lib.rs โ€” public API
  • mdast.rs โ€” syntax tree
  • parser.rs โ€” turn a string of markdown into events
  • resolve.rs โ€” steps to process events
  • state.rs โ€” steps of the state machine
  • subtokenize.rs โ€” handle content in other content
  • to_html.rs โ€” turns events into a string of HTML
  • to_mdast.rs โ€” turns events into a syntax tree
  • tokenizer.rs โ€” glue the states of the state machine together
  • unist.rs โ€” point and position, used in mdast

Test

markdown-rs is tested with the ~650 CommonMark tests and more than 1k extra tests confirmed with CM reference parsers. Then thereโ€™s even more tests for GFM and other extensions. These tests reach all branches in the code, which means that this project has 100% code coverage. Fuzz testing is used to check for things that might fall through coverage.

The following bash scripts are useful when working on this project:

  • generate code (latest CM tests and Unicode info):
    cargo run --manifest-path generate/Cargo.toml
  • run examples:
    RUST_BACKTRACE=1 RUST_LOG=debug cargo run --features log --example lib
  • format:
    cargo fmt
  • lint:
    cargo fmt --check && cargo clippy --examples --tests --benches
  • test:
    RUST_BACKTRACE=1 cargo test
  • docs:
    cargo doc --document-private-items
  • fuzz:
    cargo install cargo-fuzz
    cargo install honggfuzz
    cargo +nightly fuzz run markdown_libfuzz
    cargo hfuzz run markdown_honggfuzz

Version

markdown-rs follows SemVer.

Security

The typical security aspect discussed for markdown is cross-site scripting (XSS) attacks. Markdown itself is safe if it does not include embedded HTML or dangerous protocols in links/images (such as javascript: or data:). markdown-rs makes any markdown safe by default, even if HTML is embedded or dangerous protocols are used, as it encodes or drops them. Turning on the allow_dangerous_html or allow_dangerous_protocol options for user-provided markdown opens you up to XSS attacks.

An aspect related to XSS for security is syntax errors: markdown itself has no syntax errors. Some syntax extensions (specifically, only MDX) do include syntax errors. For that reason, to_html_with_options returns Result<String, String>, of which the error is a simple string indicating where the problem happened, what occurred, and what was expected instead. Make sure to handle your errors when using MDX.

Another security aspect is DDoS attacks. For example, an attacker could throw a 100mb file at markdown-rs, in which case itโ€™s going to take a long while to finish. It is also possible to crash markdown-rs with smaller payloads, notably when thousands of links, images, emphasis, or strong are opened but not closed. It is wise to cap the accepted size of input (500kb can hold a big book) and to process content in a different thread so that it can be stopped when needed.

For more information on markdown sanitation, see improper-markup-sanitization.md by @chalker.

Contribute

See contributing.md for ways to help. See support.md for ways to get help. See code-of-conduct.md for how to communicate in and around this project.

Sponsor

Support this effort and give back by sponsoring:

Thanks

Special thanks go out to:

Related

  • micromark โ€” same as markdown-rs but in JavaScript
  • mdxjs-rs โ€” wraps markdown-rs to compile MDX to JavaScript

License

MIT ยฉ Titus Wormer

More Repositories

1

franc

Natural language detection
JavaScript
3,906
star
2

dictionaries

Hunspell dictionaries in UTF-8
JavaScript
1,051
star
3

starry-night

Syntax highlighting, like GitHub
JavaScript
614
star
4

xdm

Just a *really* good MDX compiler. No runtime. With esbuild, Rollup, and webpack plugins
JavaScript
589
star
5

lowlight

Virtual syntax highlighting for virtual DOMs and non-HTML things
JavaScript
553
star
6

refractor

Lightweight, robust, elegant virtual syntax highlighting using Prism
JavaScript
535
star
7

mdxjs-rs

Compile MDX to JavaScript in Rust
Rust
387
star
8

nspell

๐Ÿ“ Hunspell compatible spell-checker
JavaScript
260
star
9

markdown-table

Generate a markdown (GFM) table
JavaScript
249
star
10

gemoji

Info on gemoji (GitHub Emoji)
JavaScript
218
star
11

write-music

visualise sentence length
JavaScript
192
star
12

readability

visualise readability
JavaScript
185
star
13

parse-english

English (natural language) parser
JavaScript
159
star
14

server-components-mdx-demo

React server components + MDX
JavaScript
123
star
15

emphasize

ANSI syntax highlighting for the terminal
JavaScript
101
star
16

linked-list

Minimalistic linked lists
JavaScript
81
star
17

levenshtein.c

Levenshtein algorithm in C
C
79
star
18

import-meta-resolve

Resolve things like Node.js โ€” ponyfill for `import.meta.resolve`
JavaScript
78
star
19

short-words

visualise lengthy words
JavaScript
65
star
20

trough

`trough` is middleware
JavaScript
61
star
21

bcp-47

Parse and stringify BCP 47 language tags
JavaScript
59
star
22

html-tag-names

List of known HTML tag names
JavaScript
58
star
23

parse-latin

Latin-script (natural language) parser
JavaScript
57
star
24

iso-3166

ISO 3166 (standard for country codes and codes for their subdivisions)
JavaScript
51
star
25

html-element-attributes

Map of HTML elements to allowed attributes
JavaScript
51
star
26

trim-lines

Remove spaces and tabs around line-breaks
JavaScript
50
star
27

common-words

visualise rare words
JavaScript
49
star
28

parse-entities

Parse HTML character references
JavaScript
46
star
29

iso-639-3

Info on ISO 639-3
JavaScript
46
star
30

levenshtein-rs

Levenshtein algorithm in Rust
Rust
42
star
31

emoticon

List of emoticons
JavaScript
40
star
32

direction

Detect directionality: left-to-right, right-to-left, or neutral
JavaScript
39
star
33

textom

DEPRECATED in favour of retextโ€™s virtual object model
39
star
34

dictionary

Dictionary app that can work without JavaScript or internet
JavaScript
37
star
35

f-ck

๐Ÿคฌ Clean-up cuss words
JavaScript
37
star
36

dioscuri

A gemtext (`text/gemini`) parser with support for streaming, ASTs, and CSTs
JavaScript
34
star
37

property-information

Info on the properties and attributes of the web platform
JavaScript
33
star
38

stmr.c

Porter Stemmer algorithm in C
C
32
star
39

eslint-md

Deprecated
30
star
40

svg-tag-names

List of known SVG tag names
JavaScript
29
star
41

checkmoji

Check emoji across platforms
JavaScript
26
star
42

html-void-elements

List of known void HTML elements
JavaScript
26
star
43

npm-high-impact

The high-impact (popular) packages of npm
JavaScript
26
star
44

iso-639-2

Info on ISO 639-2
JavaScript
23
star
45

aria-attributes

List of ARIA attributes
JavaScript
21
star
46

stringify-entities

Serialize (encode) HTML character references
JavaScript
21
star
47

bcp-47-match

Match BCP 47 language tags with language ranges per RFC 4647
JavaScript
19
star
48

speakers

Speaker count for 450+ languages
JavaScript
19
star
49

svg-element-attributes

Map of SVG elements to allowed attributes
JavaScript
19
star
50

osx-learn

Add words to the OS X Spell Check dictionary
Shell
18
star
51

trigrams

Trigram files for 400+ languages
JavaScript
18
star
52

fault

Functional errors with formatted output
JavaScript
17
star
53

remark-preset-wooorm

Personal markdown (and prose) style
JavaScript
17
star
54

udhr

Universal declaration of human rights
HTML
17
star
55

bcp-47-normalize

Normalize, canonicalize, and format BCP 47 tags
JavaScript
16
star
56

happy-places

Little list of happy places
15
star
57

wooorm.github.io

๐Ÿ› personal website
JavaScript
14
star
58

plain-text-data-to-json

Transform a simple plain-text database to JSON
JavaScript
14
star
59

parse-dutch

Dutch (natural language) parser
JavaScript
14
star
60

zwitch

Handle values based on a property
JavaScript
13
star
61

match-casing

Match the case of `value` to that of `base`
JavaScript
13
star
62

link-rel

List of valid values for `rel` on `<link>`
JavaScript
13
star
63

npm-esm-vs-cjs

Data on the share of ESM vs CJS on the public npm registry
JavaScript
13
star
64

linter-remark

Check markdown with remark in atom
13
star
65

is-badge

Check if `url` is a badge
JavaScript
13
star
66

vendors

List of vendor prefixes known to the web platform
JavaScript
12
star
67

load-plugin

Load a submodule / plugin
JavaScript
12
star
68

comma-separated-tokens

Parse and stringify comma-separated tokens
JavaScript
11
star
69

bail

Throw if given an error
JavaScript
11
star
70

space-separated-tokens

Parse and stringify space-separated tokens
JavaScript
10
star
71

trigram-utils

A few language trigram utilities
JavaScript
10
star
72

collapse-white-space

Collapse white space.
JavaScript
9
star
73

retext-language

Detect then language of text with Retext
JavaScript
9
star
74

longest-streak

Count the longest repeating streak of a substring
JavaScript
9
star
75

unherit

Clone a constructor without affecting the super-class
JavaScript
9
star
76

markdown-escapes

Legacy: list of escapable characters in markdown
JavaScript
9
star
77

state-toggle

Enter/exit a state
JavaScript
9
star
78

meta-name

List of values that can be used as `name`s on HTML `meta` elements
JavaScript
9
star
79

html-dangerous-encodings

List of dangerous HTML character encoding labels
JavaScript
8
star
80

character-entities

Map of named character references.
JavaScript
8
star
81

stmr

Porter Stemmer CLI
C
8
star
82

levenshtein

Levenshtein algorithm CLI
Shell
8
star
83

commonmark.json

CommonMark test spec in JSON
JavaScript
8
star
84

web-namespaces

Map of web namespaces
JavaScript
7
star
85

is-whitespace-character

Check if a character is a white space character
JavaScript
7
star
86

strip-skin-tone

Strip skin tone modifiers (as in Fitzpatrick scale) from emoji (๐ŸŽ…๐Ÿฟ to ๐ŸŽ…)
JavaScript
7
star
87

atom-travis

Install Atom on Travis
Shell
7
star
88

svg-event-attributes

List of SVG event handler attributes
JavaScript
7
star
89

control-pictures

Replace pictures for control character codes with actual control characters
JavaScript
7
star
90

css-declarations

Legacy utility to parse and stringify CSS declarations
JavaScript
6
star
91

html-encodings

Info on HTML character encodings.
JavaScript
6
star
92

mathml-tag-names

List of known MathML tag names
JavaScript
6
star
93

array-iterate

`Array#forEach()` but itโ€™s possible to define where to move to next
JavaScript
6
star
94

atom-tap-test-runner

Run Atom package tests using TAP
6
star
95

ccount

Count how often a substring occurs
JavaScript
6
star
96

doctype

Info on HTML / XHTML / MathML / SVG doctypes
JavaScript
6
star
97

retext-english

Moved
6
star
98

labels

GitHub labels
6
star
99

remark-range

Deprecated
6
star
100

dead-or-alive

check if urls are dead or alive
JavaScript
6
star