• Stars
    star
    449
  • Rank 97,328 (Top 2 %)
  • Language
    Java
  • License
    MIT License
  • Created almost 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tool that translates augmented markdown into HTML or latex

Bookish

Bookish is an xml-ish + some markdown format for books and articles that it can convert to HTML and latex. I used it to generate this article: The Matrix Calculus You Need For Deep Learning.

You can use python directly in the doc like a notebook to compute and print stuff:

and display data frames:

and even show matplotlib graphs:

As see below, it also does some really fancy magic to convert full latex equations (or even latex chunks) to SVG images for display inline (tricky to get vertical alignment correct.)

Meta-language

Bookish is mostly XML-like but uses markdown for the more common things like italics and code fonts. (Note that the xml tags do not always have an end tag or even the trailing /' as in <.../> .)

Bookish requires a root document that is kind of like a metadata file:

<book title="A simple book" author="T. Parr">

<include file=chap1.xml>
<include file=chap2.xml>

Then the chapter files look like:

<chapter label="intro" title="An intro">

Some text *foo* and `this` is code. Ref [summary], which is forward ref in
another file. Links are [cnn](http://www.cnn.com).

Cheatsheet

Here are the tags that contain attributes, not all of which are required:

<book label="..." author="..." title="..." version="...">
<chapter label="..." author="..." title="...">
<data dir="...">
<notebook-support file="..."> ;
<include file="...">
<section label="..." title="...">
<subsection label="..." title="...">
<subsubsection label="..." title="...">
<site label="..." url="...">
<citation label="..." title="..." author="...">
<chapquote quote="..." author="...">
<sidequote label="..." quote="..." author="...">
<sidenote label="..."> ... </sidenote>
<sidefig label="..." caption="..."> ... </sidefig>
<figure label="..." caption="..."> ... </figure>
<aside title="..."> ...
<pyeval label="..." output="..." hide="..."> ... </pyeval>
<pyfig label="..." side="..." hide="..."> ... </pyfig>
<py label="..."> ... or <py>...</py>
<th width="...">
<img src="..." width="...">

Origins in math-infested markup

Jeremy Howard and I wrote up a nice mathy latex document called ``The Matrix Calculus You Need For Deep Learning'' that has over 600 equations. We wanted to post it to the web in HTML or markdown but quickly ran into a problem trying to get equations rendered.

In the end we converted the source document to markdown and build a translator that generated HTML using SVG for equations and PDF from native latex equations. It does a pretty good job with html as you can see:

All of those equations, even the ones inline in the text paragraph, are <img> references.

Here is the raw matrix-calculus.md that bookish processed to generate those documents.

What's so hard about rendering equations?

If you're doing markdown or HTML, people tend to use MathJax or its faster cousin Katex. MathJax is just too slow when you have 600 equations. Katex is much better but it (and MathJax) requires every &, _, etc... be escaped as \&, \_ to avoid getting processed as markdown. That's no problem because I built a translator that escaped everything for me. Then I found out that the JavaScript parser that extracted the latex equation strings was extremely finicky. I had to randomly insert spaces in my equations trying to get them recognized as equations.

There's another problem. Is all of that JavaScript gonna work in epub formats? What about the Kindle? Because I'm hoping to write a book on machine learning, I'm leary of relying on full-blown JavaScript to render equations.

I tried pandoc and a few other tools like multimarkdown but not everything came through correctly to the translated output and I got tired of chasing all of this down.

As the ANTLR guy, I ain't afeared of building a language translator and so, following my motto ``Why program by hand in five days what you can spend five years of your life automating'', I decided to simply solve this problem by building my own markdown translator.

How to typeset and display math via SVG

If you can't use JavaScript, you have to use images. If you have to use images, you want scalable graphics, which means SVG files. So, the translator must extract equations and replace them with <img> tags referencing SVG files. That part is not too hard; take a look at Tex2SVG and you'll see that I'm just running three programs in sequence to process the equation into an SVG file: xelatex then pdfcrop then pdf2svg.

The really tricky bit is the vertical alignment of equations within a line of HTML text. Check out this sentence with embedded equations:

(I had to take a snapshot and show that instead of giving raw HTML plus equations; github's markdown processor didn't handle it properly. haha.)

What does it mean to properly align an equation's image? It's painful. We need to convince latex to give us metrics on how far the typeset image drops below the baseline. (Latex calls this the depth.) It took a while, but I figured out how to not only compute the depth below baseline but also how to get it back into this Java program via the latex log file. You can see how all of this is done here: Translator.visitEqn(). Here is the latex incantation to extract height and depth of the rendered equation:

\begin{document}
\thispagestyle{empty}
<body>
\setbox0=\vbox{<body>}
\typeout{// bookish metrics: \the\ht0, \the\dp0}
\end{document}

where <body> is the hole where the equation goes.

Oh, and to get the font to look less anemic, you need to set the math fonts:

\DeclareSymbolFont{operators}   {OT1}{ztmcm}{m}{n}
\DeclareSymbolFont{letters}     {OML}{ztmcm}{m}{it}
\DeclareSymbolFont{symbols}     {OMS}{ztmcm}{m}{n}
\DeclareSymbolFont{largesymbols}{OMX}{ztmcm}{m}{n}
\DeclareSymbolFont{bold}        {OT1}{ptm}{bx}{n}
\DeclareSymbolFont{italic}      {OT1}{ptm}{m}{it}

One last little tidbit. Image file names are based upon the MD5 digest hash of the equation. There are two benefits: (1) repeated equations share the same file and (2) latex is slow, like 1 second per equation, but the hashed filename lets us cache all of the images and know when we must refresh an image because the equation changed.

It's safe to stop reading here. You can learn everything you need to know about doing this yourself from this description and the source code. This repository is just getting started and is in progress so don't expect a tool you can use yourself, at least at the moment.

Implementation

You will also notice that I have built this program as if it were a programming language translator. The strategy I use is to construct a model of the document from the parse tree using a visitor. Then I use a fiendishly clever bit of code to automatically convert that representation of the document into a tree of string templates. Of course the set of templates you use determines what output you get. Change the templates and you change the target language. For example here are the HTML templates.

More Repositories

1

dtreeviz

A python library for decision tree visualization and model interpretation.
Jupyter Notebook
2,921
star
2

lolviz

A simple Python data-structure visualization tool for lists of lists, lists, dictionaries; primarily for use in Jupyter notebooks / presentations
Jupyter Notebook
823
star
3

tensor-sensor

The goal of this library is to generate more helpful exception messages for matrix algebra expressions for numpy, pytorch, jax, tensorflow, keras, fastai.
Jupyter Notebook
746
star
4

random-forest-importances

Code to compute permutation and drop-column importances in Python scikit-learn models
Jupyter Notebook
596
star
5

msds621

Course notes for MSDS621 at Univ of San Francisco, introduction to machine learning
Jupyter Notebook
346
star
6

simple-virtual-machine

A simple VM for a talk on building VMs
Java
207
star
7

simple-virtual-machine-C

Same as simple-virtual-machine but in C
C
136
star
8

msds692

MSAN692 Data Acquisition
HTML
125
star
9

msds501

Course notes for MSDS501, computational boot camp, at the University of San Francisco
Jupyter Notebook
123
star
10

cs652

University of San Francisco CS652 -- Programming Languages
Java
112
star
11

fundamentals-of-deep-learning

Course notes and notebooks to teach the fundamentals of how deep learning works; uses PyTorch.
Jupyter Notebook
73
star
12

msds689

Course syllabus, notes, projects for USF's MSDS689
Jupyter Notebook
64
star
13

stratx

stratx is a library for A Stratification Approach to Partial Dependence for Codependent Variables
TeX
62
star
14

ml-articles

Articles on machine learning
Jupyter Notebook
61
star
15

cs601

USF CS601 lecture notes and sample code
Java
54
star
16

msds593

MSDS593 -- Exploratory data analysis (EDA) at the University of San Francisco
Jupyter Notebook
25
star
17

website-explained.ai

The website content for explained.ai
Jupyter Notebook
23
star
18

msan501-old

USF MSAN501 lecture notes and sample code
TeX
21
star
19

mini-markdown

Parser for small subset of markdown
Java
20
star
20

cs345

CS345 Programming Languages at University of San Francisco
19
star
21

AniML-java

A Java implementation of random forest machine learning algorithm / classifier
Java
9
star
22

website-mlbook

Public repo to host website for public releases of mlbook html
HTML
8
star
23

bash-git-prompt

My own variation on the bash git prompt
Python
8
star
24

autodx

Simple automatic differentiation via operator overloading for educational purposes
TeX
7
star
25

data-acquisition

Data acquisition certificate (part of http://www.sfdatainstitute.org Course number CAS-DI-DAPY-001.
HTML
7
star
26

parrtlib

Parrt's Java library with useful functions
Java
6
star
27

gmdh

Experiment with GMDH polynomial computation-graph nodes
Python
5
star
28

msan501-starterkit

A starter kit with tests and skeleton code for the computational analytics boot camp, MSAN501, at the University of San Francisco.
Python
5
star
29

bild

A simple build utility written in Python, though I'll use to build java projects.
Python
5
star
30

c_unit

A C unit testing rig in the spirit of junit.
C
4
star
31

sample-jetbrains-plugin

A sample jetbrains plugin that uses ANTLR for lexing/parsing.
Java
4
star
32

java-neural-net

A simple neural network in java using particle swarm optimization.
Java
4
star
33

playdl

Playing with deep learning
Jupyter Notebook
3
star
34

antlr4-demo-simple-lang

Simple language grammar and listener for talk demos
Java
3
star
35

hash-duo

Explore building a hash table with two different hash functions that balances chain length
C++
3
star
36

selfnet

Playing with self-organizing deep learning neural networks
Jupyter Notebook
2
star
37

pltvid

A simple library to capture multiple matplotlib plots as a movie.
Jupyter Notebook
2
star
38

gpu-test

A test of OpenCL use on OS X, XCode. Simple vector squaring.
C
2
star
39

learn-git

1
star
40

gradle-antlr-plugin

The Official Gradle ANTLR plugin
1
star
41

cs601-webmail-skeleton

Some goodies to help start the CS601 webmail project
Java
1
star
42

cs601-webmail-st-skeleton

StringTemplate-based version of webmail skeleon
Java
1
star
43

inclass

1
star
44

foobar

1
star
45

website-book.explained.ai

HTML
1
star
46

demo

test for class
Java
1
star
47

website-faculty-parrt

My faculty web page
HTML
1
star
48

annotation-processor

Java
1
star