• Stars
    star
    1,203
  • Rank 38,934 (Top 0.8 %)
  • Language
    Lua
  • License
    MIT License
  • Created about 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)

im2markup

A general-purpose, deep learning-based system to decompile an image into presentational markup. For example, we can infer the LaTeX or HTML source from a rendered image.

An example input is a rendered LaTeX formula:

The goal is to infer the LaTeX formula that can render such an image:

 d s _ { 1 1 } ^ { 2 } = d x ^ { + } d x ^ { - } + l _ { p } ^ { 9 } \frac { p _ { - } } { r ^ { 7 } } \delta ( x ^ { - } ) d x ^ { - } d x ^ { - } + d x _ { 1 } ^ { 2 } + \; \cdots \; + d x _ { 9 } ^ { 2 } 

Our model employs a convolutional network for text and layout recognition in tandem with an attention-based neural machine translation system. The use of attention additionally provides an alignment from the generated markup to the original source image:

See our website for a complete interactive version of this visualization over the test set. Our paper (http://arxiv.org/pdf/1609.04938v1.pdf) provides more technical details of this model.

What You Get Is What You See: A Visual Markup Decompiler  
Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush
http://arxiv.org/pdf/1609.04938v1.pdf

Prerequsites

Most of the code is written in Torch, with Python for preprocessing.

Torch

Model

The following lua libraries are required for the main model.

  • tds
  • class
  • nn
  • nngraph
  • cunn
  • cudnn
  • cutorch

Note that currently we only support GPU since we use cudnn in the CNN part.

Preprocess

Python

  • Pillow
  • numpy

Optional: We use Node.js and KaTeX for preprocessing Installation

pdflatex Installaton

Pdflatex is used for rendering LaTex during evaluation.

ImageMagick convert Installation

Convert is used for rending LaTex during evaluation.

Webkit2png Installation

Webkit2png is used for rendering HTML during evaluation.

Evaluate

Python image-based evaluation

  • python-Levenshtein
  • matplotlib
  • Distance
wget https://im2markup.yuntiandeng.com/third_party/Distance-0.1.3.tar.gz
tar zxf Distance-0.1.3.tar.gz
cd distance; sudo python setup.py install
Perl Installation

Perl is used for evaluating BLEU score.

Usage

We assume that the working directory is im2markup throught this document. The task is to convert an image into its presentational markup, so we need to specify a data_base_dir storing the images, a label_path storing all labels (e.g., latex formulas). Besides, we need to specify a data_path for the training (or test) data samples. The format of data_path shall look like:

<img_name1> <label_idx1>
<img_name2> <label_idx2>
<img_name3> <label_idx3>
...

where <label_idx> denotes the line index of the label (starting from 0).

Quick Start (Math-to-LaTeX Toy Example)

To get started with, we provide a toy Math-to-LaTex example. We have a larger dataset im2latex-100k-dataset of the same format but with much more samples, and a processed version can be found at https://im2markup.yuntiandeng.com/data/ (if you use this version, you can skip the preprocessing commands below).

Preprocess

The images in the dataset contain a LaTeX formula rendered on a full page. To accelerate training, we need to preprocess the images.

python scripts/preprocessing/preprocess_images.py --input-dir data/sample/images --output-dir data/sample/images_processed

The above command will crop the formula area, and group images of similar sizes to facilitate batching.

Next, the LaTeX formulas need to be tokenized or normalized.

python scripts/preprocessing/preprocess_formulas.py --mode normalize --input-file data/sample/formulas.lst --output-file data/sample/formulas.norm.lst

The above command will normalize the formulas. Note that this command will produce some error messages since some formulas cannot be parsed by the KaTeX parser.

Then we need to prepare train, validation and test files. We will exclude large images from training and validation set, and we also ignore formulas with too many tokens or formulas with grammar errors.

python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/train.lst --output-path data/sample/train_filter.lst 
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/validate.lst --output-path data/sample/validate_filter.lst 
python scripts/preprocessing/preprocess_filter.py --no-filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/test.lst --output-path data/sample/test_filter.lst 

Finally, we generate the vocabulary from training set. All tokens occuring less than (including) 1 time will be excluded from the vocabulary.

python scripts/preprocessing/generate_latex_vocab.py --data-path data/sample/train_filter.lst --label-path data/sample/formulas.norm.lst --output-file data/sample/latex_vocab.txt

Train

For a complete set of parameters, run

th src/train.lua -h

The most important parameters for training are data_base_dir, which specifies where the images live; data_path, the training file; label_path, the LaTeX formulas, val_data_path, the validation file; vocab_file, the vocabulary file with one token per each line.

th src/train.lua -phase train -gpu_id 1 \
-model_dir model \
-input_feed -prealloc \
-data_base_dir data/sample/images_processed/ \
-data_path data/sample/train_filter.lst \
-val_data_path data/sample/validate_filter.lst \
-label_path data/sample/formulas.norm.lst \
-vocab_file data/sample/latex_vocab.txt \
-max_num_tokens 150 -max_image_width 500 -max_image_height 160 \
-batch_size 20 -beam_size 1

In the default setting, the log file will be put to log.txt. The log file records the training and validation perplexities. model_dir speicifies where the models should be saved. The default parameters are optimized for the full dataset. In order to overfit on this toy example, use flags -learning_rate 0.05, -lr_decay 1.0 and -num_epochs 30, then after 30 epochs, the training perplexity can reach around 1.1 and the validation perplexity can only reach around 17.

Test

After training, we can load a model and use it to test on test dataset. We provide a model trained on the im2latex-100k-dataset.

mkdir -p model/latex; wget -P model/latex/ https://im2markup.yuntiandeng.com/model/latex/final-model

Now we can load the model and test on test set. Note that in order to output the predictions, a flag -visualize must be set.

th src/train.lua -phase test -gpu_id 1 -load_model -model_dir model/latex -visualize \
-data_base_dir data/sample/images_processed/ \
-data_path data/sample/test_filter.lst \
-label_path data/sample/formulas.norm.lst \
-output_dir results \
-max_num_tokens 500 -max_image_width 800 -max_image_height 800 \
-batch_size 5 -beam_size 5 

Note that we do not specify a vocabulary file here, since it is already included in the model. After a while, the perplexities will be logged, and the predictions file results.txt will be put to output_dir. The format of the predicitons file is:

<img_name1>\t<label_gold1>\t<label_pred1>\t<score_pred1>\t<score_gold1>
<img_name2>\t<label_gold2>\t<label_pred>2\t<score_pred2>\t<score_gold2>
...

where \t denotes tab.

Evaluate

Text Metrics

The test perplexity can be obtained after testing is finished. In order to evaluate BLEU, the following command needs to be executed.

python scripts/evaluation/evaluate_bleu.py --result-path results/results.txt --data-path data/sample/test_filter.lst --label-path data/sample/formulas.norm.lst

Note that although the predicions file contains the gold labels, since some images (e.g., too large sizes) will be ignored during testing, to make the comparison fair, we need to use the test file again and treat those that does not appear in predictions file as blank predictions.

We also provide script for evaluating text edit distance (on a token level).

python scripts/evaluation/evaluate_text_edit_distance.py --result-path results/results.txt

Image Metrics

First, we need to render both the original formulas and the predicted formulas. This may take a bit long time. The rendering process depends on pdflatex and ImageMagick convert.

python scripts/evaluation/render_latex.py --result-path results/results.txt --data-path data/sample/test_filter.lst --label-path data/sample/formulas.lst --output-dir data/sample/images_rendered --no-replace

Afterwards, based on the rendered images, we evaluate the exact match accuracy and the image edit distance.

python scripts/evaluation/evaluate_image.py --images-dir data/sample/images_rendered/

Web page-to-HTML

Another example is to infer the HTML from an image of a web page. We provide a simplified dataset: web pages of size 100X100. (However, in the provided dataset, we downsample to 64X64). Note that we can use the same model parameters as the Math-to-LaTex task, the only difference here is the vocabulary.

First, download the dataset.

wget -P data/ https://im2markup.yuntiandeng.com/html/data/html_64_64_100k.tgz
cd data; tar zxf html_64_64_100k.tgz; cd ..

Train

The training parameters is nearly identical to the Math-to-LaTex task. However, some parameters such as max_image_width need to be set to a different value for memory efficiency and convergence speed.

th src/train.lua -phase train -gpu_id 1 \
-input_feed -prealloc \
-data_base_dir data/html_64_64_100k/data/ \
-data_path data/html_64_64_100k/html_train.txt \
-vocab_file data/html_64_64_100k/html_vocab.txt \
-val_data_path data/html_64_64_100k/html_dev.txt \
-label_path data/html_64_64_100k/html_sources.txt \
-vocab_file data/html_64_64_100k/html_vocab.txt \
-max_num_tokens 100 -max_image_width 64 -max_image_height 64 \
-batch_size 100 -beam_size 1

Test

We provide a trained model as well.

mkdir -p model/html; wget -P model/html/ https://im2markup.yuntiandeng.com/model/html/final-model

Now we are ready to test our model.

th src/train.lua -phase test -gpu_id 1 -load_model -model_dir model/html -visualize \
-data_base_dir data/html_64_64_100k/data/ \
-data_path data/html_64_64_100k/html_test.txt \
-label_path data/html_64_64_100k/html_sources.txt \
-output_dir results \
-max_num_tokens 400 -max_image_width 64 -max_image_height 64 \
-batch_size 80 -beam_size 5 

Evaluate

Text Metrics

The test perplexity can be obtained after testing is finished. In order to evaluate text edit distance, the following command needs to be executed.

python scripts/evaluation/evaluate_text_edit_distance.py --result-path results/results.txt

Image Metrics

First, we need to render both the original htmls and the predicted htmls. This may take a bit long time. Note that rendering is based on webkit2png.

python scripts/evaluation/render_html.py --result-path results/results.txt --output-dir data/html_64_64_100k/images_rendered --no-replace

Afterwards, based on the rendered images, we evaluate the exact match accuracy and the image edit distance.

python scripts/evaluation/evaluate_image.py --images-dir data/html_64_64_100k/images_rendered/

Acknowlegement

Our implementation is based on HarvardNLP NMT implementation seq2seq-attn. And we would like to thank Yoon Kim and Allen Schmaltz for helpful discussions during this project.

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
4

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
5

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
6

var-attn

Latent Alignment and Variational Attention
Python
326
star
7

sent-summary

300
star
8

neural-template-gen

Python
262
star
9

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
10

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
11

urnng

Python
176
star
12

botnet-detection

Topological botnet detection datasets and graph neural network applications
Python
169
star
13

data2text

Lua
158
star
14

sa-vae

Python
154
star
15

compound-pcfg

Python
127
star
16

cascaded-generation

Cascaded Text Generation with Markov Transformers
Python
127
star
17

TextFlow

Python
116
star
18

boxscore-data

HTML
111
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star