A tree-sitter parser for the typst file format
This language is soooo hard to parseโฆ whitespace, parenthesizes for everything, and Unicode :(
DONE:
-
[O] Code mode:
#
to enter code mode- any literal:
1
,"hi"
,true
,false
,none
,auto
- raw and labels are literals
- code block:
{ x = 1 }
- content block:
[ hello ]
- parenthesized expression:
(1 + 2)
- array:
(1, 2, 3)
- dictionary:
(a: "hi", b: 2)
- unary operator:
-x
- binary operator:
x + y
- assignment:
x = 1
- variable access:
x
- field access:
x.y
- method call:
x.flatten()
- named function:
let f(x) = 2 * x
- unnamed function:
(x, y) => x + y
- function call:
min(x, y)
- let binding:
let x = 1
- set rule:
set text(14pt)
- set-if rule:
set text(..) if ..
- show-set rule:
show par: set block(..)
- show rule with function:
show par: set block(..)
- show-everything rule:
show: set block(..)
- conditional:
if x < 0 {0} else {x}
- for loop:
for x in [1, 2, 3]
- while loop:
while x < 10 {}
- loop control flow:
break
,continue
- return from function:
return x
- include module:
include "bar.typ"
- import module:
import "bar.typ"
- import items from module:
import "bar.typ": a, b, c
- comment:
// hi
or/* hi */
.
- any literal:
-
Math mode
- Everything :)
-
Markup mode
- Whitespace (Unicode)
- paragraph break
- text (Unicode)
- emphasis
- strong
- italic
- label
- reference
- raw text
- inline
- block
- link
- heading
- bullet list
- numbered list
- term list
- math
- line break
- smart quote
- single quote
- double quote
- symbol shorthand
- code expression
- character escape
- comment.
Outdated specification comes from: https://www.user.tu-berlin.de/laurmaedje/programmable-markup-language-for-typesetting.pdf
I'll be using the textmate grammar as inspiration: https://github.com/typst/typst/blob/main/tools/support/typst.tmLanguage.json
For myself, I'll paste it here:
Typst Grammar
Below is an approximate EBNF grammar for the Typst language that is based on our handwritten recursive descent parser. We follow these conventions:
โ Production names are all lowercase.
โ Text enclosed in single (') or double quotes (") defines a terminal.
โ * for an arbitrary number of repetitions.
โ + for at least one repetition.
โ ? for zero or one repetitions.
โ ! to negate a simple (character-class-like) production.
โ . to match an arbitrary character.
โ a - b to match anything that matches a but not b.
โ unicode(Property) to match any character that has the given unicode property.
Note that comments and spaces are allowed almost everywhere within code constructs. For readability, this is omitted in the grammar. Moreover, the grammar omits the indentation rules for lists, as EBNF cannot handle context-sensitive constructs.
// Markup.
markup ::= markup-node*
markup-node ::=
space | nbsp | shy | endash | emdash | ellipsis | quote |
strong | emph | raw | link | math | heading | list | enum | desc
// Markup nodes.
nbsp ::= '~'
shy ::= '-?'
endash ::= '--'
emdash = '---'
ellipsis ::= '...'
quote ::= "'" | '"'
strong ::= '*' markup '*'
raw ::= '`' (raw | .*) '`'
link ::= 'http' 's'? '://' (!space)*
math ::= ('$' .* '$') | ('$[' .* ']$')
heading ::= '='+ space markup
list ::= '-' space markup
enum ::= digit* '.' space markup
desc ::= '/' space markup ':' space markup