• This repository has been archived on 28/Jul/2022
  • Stars
    star
    105
  • Rank 328,196 (Top 7 %)
  • Language
    HTML
  • License
    BSD 3-Clause "New...
  • Created almost 11 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Experimental markdown processor in Haskell

Cheapskate

Note: This library is unmaintained (by me anyway). I recommend using commonmark.

This is an experimental Markdown processor in pure Haskell. (A cheapskate is always in search of the best markdown.) It aims to process Markdown efficiently and in the most forgiving possible way. It is about seven times faster than pandoc and uses a fifth the memory. It is also faster, and considerably more accurate, than the markdown package on Hackage.

There is no such thing as an invalid Markdown document. Any string of characters is valid Markdown. So the processor should finish efficiently no matter what input it gets. Garbage in should not cause an error or exponential slowdowns. This processor has been tested on many large inputs consisting of random strings of characters, with performance that is consistently linear with the input size. (Try make fuzztest.)

Installing

To build, get the Haskell Platform, then:

cabal update && cabal install

This will install both the cheapskate executable and the Haskell library. A man page can be found in man/man1 in the source.

Usage

As an executable:

cheapskate [FILE*]

As a library:

import Cheapskate
import Text.Blaze.Html

toMarkdown :: Text -> Html
toMarkdown = toHtml . markdown def

If the markdown input you are converting comes from an untrusted source (e.g. a web form), you should always set sanitize to True. This causes the generated HTML to be filtered through xss-sanitize's sanitizeBalance function. Otherwise you risk a XSS attack from raw HTML or a markdown link or image attribute attribute.

You may also wish to disallow users from entering raw HTML for aesthetic, rather than security reasons. In that case, set allowRawHtml to False, but let sanitize stay True, since it still affects attributes coming from markdown links and images.

Manipulating the parsed document

You can manipulate the parsed document before rendering using the walk and walkM functions. For example, you might want to highlight code blocks using highlighting-kate:

import Data.Text as T
import Data.Text.Lazy as TL
import Cheapskate
import Text.Blaze.Html
import Text.Blaze.Html.Renderer.Text
import Text.Highlighting.Kate

markdownWithHighlighting :: Text -> Html
markdownWithHighlighting = toHtml . walk addHighlighting . markdown def

addHighlighting :: Block -> Block
addHighlighting (CodeBlock (CodeAttr lang _) t) =
  HtmlBlock (T.concat $ TL.toChunks
             $ renderHtml $ toHtml
             $ formatHtmlBlock defaultFormatOpts
             $ highlightAs (T.unpack lang) (T.unpack t))
addHighlighting x = x

Extensions

This processor adds the following Markdown extensions:

Hyperlinked URLs

All absolute URLs are automatically made into hyperlinks, where inside <> or not.

Fenced code blocks

Fenced code blocks with attributes are allowed. These begin with a line of three or more backticks or tildes, followed by an optional language name and possibly other metadata. They end with a line of backticks or tildes (the same character as started the code block) of at least the length of the starting line.

Explicit hard line breaks

A hard line break can be indicated with a backslash before a newline. The standard method of two spaces before a newline also works, but this gives a more "visible" alternative.

Backslash escapes

All ASCII symbols and punctuation marks can be backslash-escaped, not just those with a use in Markdown.

Revisions

In departs from the markdown syntax document in the following ways:

Intraword emphasis

Underscores cannot be used for word-internal emphasis. This prevents common mistakes with filenames, usernames, and indentifiers. Asterisks can still be used if word internal emphasis is needed.

The exact rule is this: an underscore that appears directly after an alphanumeric character does not begin an emphasized span. (However, an underscore directly before an alphanumeric can end an emphasized span.)

Ordered lists

The starting number of an ordered list is now significant. Other numbers are ignored, so you can still use 1. for each list item.

In addition to the 1. form, you can use 1) in your ordered lists. A new list starts if you change the form of the delimiter. So, the following is two lists:

1. one
2. two
1) one
2) two

Bullet lists

A new bullet lists starts if you change the bullet marker. So, the following is two consecutive bullet lists:

+ one
+ two
- one
- two

List separation

Two blank lines breaks out of a list. This allows you to have consecutive lists:

- one

- two


- one (new list)

The blank lines break out of a list no matter how deeply it is nested:

- one
  - two
    - three


  - new top-level list

Indentation of list continuations

Block elements inside list items need not be indented four spaces. If they are indented beyond the bullet or numerical list marker, they will be considered additional blocks inside the list item. So, the following is a list item with two paragraphs:

- one

 two

The amount of indentation required for an indented code block inside a list item depends on the first line of the list item. Generally speaking, code must be indented four spaces past the first non-space character after the list marker. Thus:

 -   My code

         {code here}

 - My code

       {code here}

The following diagram shows how the first line of a list item divides the following lines into three regions:

 -   My code
  |     |
  +-----+

Content to the left of the marked region will not be part of the list item. Content to the right of the marked region will be indented code under the list item. Regular blocks that belong under the list item should start inside the marked region.

When the first line itself contains indented code, this code and subsequent indented code blocks should be indented five spaces past the list marker:

 -     { code }

       { more code }

Raw HTML blocks

Raw HTML blocks work a bit differently than in Markdown.pl. A raw HTML block starts with a block-level HTML tag (opening or closing), or a comment start <!-- or end -->, and goes until the next blank line. The whole block is included as raw HTML. No attempt is made to parse balanced tags. This means that in the following, the asterisks are literal asterisks:

<div>
*hello*
</div>

while in the following, the asterisks are interpreted as markdown emphasis:

<div>

*hello*

</div>

In the first example, we have a single raw HTML block; in the second, we have two raw HTML blocks with an intervening paragraph. This system provides flexibility to authors to use enclose markdown sections in html block-level tags if they wish, while also allowing them to include verbatim HTML blocks (taking care that the don't include any blank lines).

As a consequence of this rule, HTML blocks may not contain blank lines.

Clarifications

This implementation resolves the following issues left vague in the markdown syntax document:

Tight vs. loose lists

A list is considered "tight" if (a) it has only one item or there is no blank space between any two consecutive items, and (b) no item has blank lines as its immediate children. If a list is "tight," then list items consisting of a single paragraph or a paragraph followed by a sublist will be rendered without <p> tags.

Sublists

Sublists work like other block elements inside list items; they must be indented past the bullet or numerical list marker (but no more than three spaces past, or they will be interpreted as indented code).

ATX headers

ATX headers must have a space after the initial ###s.

Separation of block quotes

A blank line will end a blockquote. So, the following is a single blockquote:

> hi
>
> there

But this is two blockquotes:

> hi

> there

Blank lines are not required before horizontal rules, blockquotes, lists, code blocks, or headers. They are not required after, either, though in many cases "laziness" will effectively require a blank line after. For example, in

Hello there.
> A quote.
Still a quote.

the "Still a quote." is part of the block quote, because of laziness (the ability to leave off the > from the beginning of subsequent lines). Laziness also affects lists. However, we can have a code block, ATX header, or horizontal rule between two paragraphs without any blank lines.

Link references

Link references may occur anywhere in the document, even in nested list contexts. They need not be at the outer level.

Tests

The tests subdirectory contains an extensive suite of tests, including all of John Gruber's original Markdown tests, plus many of the tests from Michel Fortin's mdtest suite. Each test consists in two files with the same basename, a markdown source and an expected HTML output.

To run the test suite, do

make test

To run only tests that match a regex pattern, do

PATT=Orig make test

Setting the environment variable TIDY=1 will run the expected and actual output through tidy before comparing them. You can run this test suite on another markdown processor by doing

PROG=myothermarkdown make test

Benchmarks

To run a crude benchmark comparing cheapskate to pandoc, do make bench. Set the BENCHPROGS environment variable to compare to other implementations.

License

Copyright © 2012, 2013, 2014 John MacFarlane.

The library is released under the BSD license; see LICENSE for terms.

Some of the test cases are borrowed from Michel Fortin's mdtest suite and John Gruber's original markdown test suite.

More Repositories

1

pandoc

Universal markup converter
Haskell
34,313
star
2

gitit

A wiki using HAppS, pandoc, and git
Haskell
2,126
star
3

djot

A light markup language
HTML
1,557
star
4

peg-markdown

An implementation of markdown in C, using a PEG grammar
C
686
star
5

pandocfilters

A python module for writing pandoc filters, with a collection of examples
Python
511
star
6

pandoc-templates

Templates for pandoc, tagged to release
HTML
418
star
7

yst

create static websites from YAML data and string templates
Haskell
373
star
8

texmath

A Haskell library for converting LaTeX math to MathML.
Haskell
291
star
9

pandoc-citeproc

Library and executable for using citeproc with pandoc
Haskell
288
star
10

lunamark

Lua library for conversion between markup formats
C
192
star
11

skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
Haskell
185
star
12

citeproc

CSL citation processing library in Haskell
Haskell
138
star
13

commonmark-hs

Pure Haskell commonmark parsing library, designed to be flexible and extensible
Haskell
130
star
14

djot.js

JavaScript implementation of djot
TypeScript
120
star
15

highlighting-kate

A syntax highlighting library in Haskell, based on Kate syntax definitions
HTML
109
star
16

pandoc-types

types for representing structured documents
Haskell
105
star
17

gitit2

A reimplementation of gitit in Yesod
Haskell
94
star
18

lcmark

Flexible CommonMark converter
Lua
54
star
19

doctemplates

Pandoc-compatible templating system
Haskell
49
star
20

zip-archive

Native Haskell library for working with zip archives
Haskell
44
star
21

cmark-hs

Haskell bindings to libcmark commonmark parser
C
43
star
22

djot.lua

Lua parser for the djot light markup language
Lua
39
star
23

typst-hs

Haskell library for parsing and evaluating typst
Haskell
32
star
24

dotvim

My vim configuration
Vim Script
30
star
25

scripts

A collection of small scripts to do various things
Shell
28
star
26

filestore

A versioning file store backed by git, darcs, or mercurial
Haskell
28
star
27

pandoc-website

Source files for pandoc's website
Lua
28
star
28

illuminate

An efficient syntax highlighting library in Haskell, using alex-generated lexers
Haskell
26
star
29

emojis

Haskell library for emojis
Haskell
25
star
30

markdown-peg

A Haskell implementation of markdown using a PEG grammar
Haskell
24
star
31

pandoc-server

Simple server app for pandoc conversions.
Haskell
20
star
32

doclayout

A prettyprinting library designed for laying out plain text documents
Haskell
20
star
33

standalone-html

Incorporates external dependencies into HTML file using data: URI scheme
Haskell
19
star
34

pandoc-tex2svg

Pandoc filter to convert math to SVG using MathJax-node's tex2svg
HTML
19
star
35

cloudlib

tools for keeping a library of books and articles on Amazon's S3 and SimpleDB
Ruby
19
star
36

cmark-lua

Lua bindings to libcmark CommonMark parser
C
17
star
37

HeX

a flexible text macro system
Haskell
17
star
38

djoths

Haskell parser for the djot light markup language
Haskell
17
star
39

unicode-collation

Haskell implementation of the Unicode Collation Algorithm
Haskell
16
star
40

sep-offprint

Creates formatted "offprints" of Stanford Encyclopedia of Philosophy entries.
15
star
41

BayHac2014

Slides for my presentation on pandoc at BayHac2014
TeX
14
star
42

cmarkpdf

Steps towards a PDF renderer for cmark using libharu
C
14
star
43

lunamark-standalone

Standalone version of lunamark (compiled with no library dependencies)
C
12
star
44

commonmarker

Ruby wrapper for libcmark (CommonMark parser)
Ruby
12
star
45

ipynb

Data structures and JSON serializer/deserializer for Jupyter notebooks (.ipynb) format.
Jupyter Notebook
11
star
46

hsb2hs

Preprocessor for inserting literals with binary blobs into Haskell programs.
Haskell
11
star
47

gogar

Computer implementation of Robert Brandom's "game of giving and asking for reasons," from Making It Explicit, chapter 3.
Ruby
10
star
48

emacsd

emacs configuration
Emacs Lisp
9
star
49

hscommonmark

pure Haskell CommonMark parser
Haskell
9
star
50

recaptcha

Haskell library for using the reCAPTCHA service
Haskell
8
star
51

select-meta

Pandoc lua filter for constructing metadata from YAML data sources using queries
Lua
8
star
52

html2cmark

Lua library to convert HTML5 to commonmark
Lua
8
star
53

citeproc-hs-bin

Command-line interface to the citeproc-hs CSL citation processing library
Haskell
8
star
54

grammata

Well-typed system for generating documents in multiple formats
Haskell
7
star
55

hw2gitit

Script to convert haskellwiki pages to a gitit wiki
Haskell
7
star
56

ecstatic

Static website management using tenjin templates and YAML data files
Ruby
7
star
57

hsgit

A higher-level interface to libgit2 functions than hlibgit2
Haskell
6
star
58

trypandoc

Live demo of pandoc
JavaScript
6
star
59

pandoc-highlight

Filter and library for using pandoc with highlighting-kate
Haskell
6
star
60

commonmark-lua

Lua binding to libcmark commonmark parser
Lua
5
star
61

rfc5051

Haskell implementation of RFC5051, simple unicode collation.
Haskell
5
star
62

jgm.github.com

jgm's web pages on github
4
star
63

rocks

luarocks repository
4
star
64

cmark-fuzz-data

A minimal fuzz test suite for cmark created by american fuzzy lop and afl-cmin
3
star
65

GHCUnicodeAlt

Improved version of GHC.Unicode, with benchmarks
Haskell
3
star
66

luacmark

Lua binding to CommonMark
C
2
star
67

typst-symbols

Defines symbols and emoji used in typst
Haskell
2
star