• Stars
    star
    673
  • Rank 67,060 (Top 2 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created 11 months ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Math OCR model that outputs LaTeX and markdown

Texify

Texify is an OCR model that converts images or pdfs containing math into markdown and LaTeX that can be rendered by MathJax ($$ and $ are delimiters). It can run on CPU, GPU, or MPS.

demo.mp4

Texify can work with block equations, or equations mixed with text (inline). It will convert both the equations and the text.

The closest open source comparisons to texify are pix2tex and nougat, although they're designed for different purposes:

  • Pix2tex is designed only for block LaTeX equations, and hallucinates more on text.
  • Nougat is designed to OCR entire pages, and hallucinates more on small images only containing math.

Pix2tex is trained on im2latex, and nougat is trained on arxiv. Texify is trained on a more diverse set of web data, and works on a range of images.

See more details in the benchmarks section.

Community

Discord is where we discuss future development.

Examples

Note I added spaces after _ symbols and removed , because Github math formatting is broken.

Example 0

Detected Text The potential $V_ i$ of cell $\mathcal{C}_ i$ centred at position $\mathbf{r}_ i$ is related to the surface charge densities $\sigma_ j$ of cells $\mathcal{C}_ j$ $j\in[1,N]$ through the superposition principle as: $$V_ i = \sum_ {j=0}^{N} \frac{\sigma_ j}{4\pi\varepsilon_ 0} \int_ {\mathcal{C}_ j} \frac{1}{|\mathbf{r}_ i-\mathbf{r}'|} \mathrm{d}^2\mathbf{r}' = \sum_{j=0}^{N} Q_ {ij} \sigma_ j,$$ where the integral over the surface of cell $\mathcal{C}_ j$ only depends on $\mathcal{C}_ j$ shape and on the relative position of the target point $\mathbf{r}_ i$ with respect to $\mathcal{C}_ j$ location, as $\sigma_ j$ is assumed constant over the whole surface of cell $\mathcal{C}_ j$.

Image OCR Markdown
1 1
2 2
3 3

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

`pip install texify`

Model weights will automatically download the first time you run it.

Usage

  • Inspect the settings in texify/settings.py. You can override any settings with environment variables.
  • Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda or TORCH_DEVICE=mps.

Usage tips

  • Don't make your boxes too small or too large. See the examples and the video above for good crops.
  • Texify is sensitive to how you draw the box around the text you want to OCR. If you get bad results, try selecting a slightly different box, or splitting the box into 2+. You can also try changing the TEMPERATURE setting.
  • Sometimes, KaTeX won't be able to render an equation (red error), but it will still be valid LaTeX. You can copy the LaTeX and render it elsewhere.

App for interactive conversion

I've included a streamlit app that lets you interactively select and convert equations from images or PDF files. Run it with:

texify_gui

The app will allow you to select the specific equations you want to convert on each page, then render the results with KaTeX and enable easy copying.

Convert images

You can OCR a single image or a folder of images with:

texify /path/to/folder_or_file --max 8 --json_path results.json
  • --max is how many images in the folder to convert at most. Omit this to convert all images in the folder.
  • --json_path is an optional path to a json file where the results will be saved. If you omit this, the results will be saved to data/results.json.
  • --katex_compatible will make the output more compatible with KaTeX.

Import and run

You can import texify and run it in python code:

from texify.inference import batch_inference
from texify.model.model import load_model
from texify.model.processor import load_processor
from PIL import Image

model = load_model()
processor = load_processor()
img = Image.open("test.png") # Your image name here
results = batch_inference([img], model, processor)

See texify/output.py:replace_katex_invalid if you want to make the output more compatible with KaTeX.

Manual install

If you want to develop texify, you can install it manually:

  • git clone https://github.com/VikParuchuri/texify.git
  • cd texify
  • poetry install # Installs main and dev dependencies

Limitations

OCR is complicated, and texify is not perfect. Here are some known limitations:

  • The OCR is dependent on how you crop the image. If you get bad results, try a different selection/crop. Or try changing the TEMPERATURE setting.
  • Texify will OCR equations and surrounding text, but is not good for general purpose OCR. Think sections of a page instead of a whole page.
  • Texify was mostly trained with 96 DPI images, and only at a max 420x420 resolution. Very wide or very tall images may not work well.
  • It works best with English, although it should support other languages with similar character sets.
  • The output format will be markdown with embedded LaTeX for equations (close to Github flavored markdown). It will not be pure LaTeX.

Benchmarks

Benchmarking OCR quality is hard - you ideally need a parallel corpus that models haven't been trained on. I sampled from arxiv and im2latex to create the benchmark set.

Benchmark results

Each model is trained on one of the benchmark tasks:

  • Nougat was trained on arxiv, possibly the images in the benchmark.
  • Pix2tex was trained on im2latex.
  • Texify was trained on im2latex. It was trained on arxiv, but not the images in the benchmark.

Although this makes the benchmark results biased, it does seem like a good compromise, since nougat and pix2tex don't work as well out of domain. Note that neither pix2tex or nougat is really designed for this task (OCR inline equations and text), so this is not a perfect comparison.

Model BLEU ⬆ METEOR ⬆ Edit Distance ⬇
pix2tex 0.382659 0.543363 0.352533
nougat 0.697667 0.668331 0.288159
texify 0.842349 0.885731 0.0651534

Running your own benchmarks

You can benchmark the performance of texify on your machine.

  • Follow the manual install instructions above.
  • If you want to use pix2tex, run pip install pix2tex
  • If you want to use nougat, run pip install nougat-ocr
  • Download the benchmark data here and put it in the data folder.
  • Run benchmark.py like this:
python benchmark.py --max 100 --pix2tex --nougat --data_path data/bench_data.json --result_path data/bench_results.json

This will benchmark marker against pix2tex and nougat. It will do batch inference with texify and nougat, but not with pix2tex, since I couldn't find an option for batching.

  • --max is how many benchmark images to convert at most.
  • --data_path is the path to the benchmark data. If you omit this, it will use the default path.
  • --result_path is the path to the benchmark results. If you omit this, it will use the default path.
  • --pix2tex specifies whether to run pix2tex (Latex-OCR) or not.
  • --nougat specifies whether to run nougat or not.

Training

Texify was trained on latex images and paired equations from across the web. It includes the im2latex dataset. Training happened on 4x A6000s for 2 days (~6 epochs).

Commercial usage

This model is trained on top of the openly licensed Donut model, and thus can be used for commercial purposes. Model weights are licensed under the CC BY-SA 4.0 license.

Thanks

This work would not have been possible without lots of amazing open source work. I particularly want to acknowledge Lukas Blecher, whose work on Nougat and pix2tex was key for this project. I learned a lot from his code, and used parts of it for texify.

  • im2latex - one of the datasets used for training
  • Donut from Naver, the base model for texify
  • Nougat - I used the tokenizer from Nougat
  • Latex-OCR - The original open source Latex OCR project

More Repositories

1

marker

Convert PDF to markdown quickly with high accuracy
Python
15,391
star
2

surya

OCR, layout analysis, reading order, line detection in 90+ languages
Python
9,453
star
3

apartment-finder

A Slack bot that helps you find an apartment.
Python
1,061
star
4

zero_to_gpt

Go from no deep learning knowledge to implementing GPT.
Jupyter Notebook
940
star
5

textbook_quality

Generate textbook-quality synthetic LLM pretraining data
Python
467
star
6

pdftext

Extract structured text from pdfs quickly
Python
261
star
7

libgen_to_txt

Convert all of libgen to high quality markdown
Python
235
star
8

scribe

Simple speech recognition using your microphone.
Python
123
star
9

researcher

Concise answers to search queries using Google and GPT-3. Includes citations.
Python
72
star
10

scan

Score essays automatically with an easy web interface.
Python
41
star
11

evolve-music2

Evolve music automatically with python -- rewrite of evolve-music.
Python
40
star
12

classified

Score LLM pretraining data with classifiers
Python
38
star
13

evolve-music

Superseded by github.com/vikparuchuri/evolve-music2 -- use that instead.
C
25
star
14

simpsons-scripts

Find out how much the simpsons characters like each other with text and audio analysis.
Python
24
star
15

movide

The student-centric learning platform.
Python
18
star
16

snapcheck

Find out if your info was leaked.
Python
15
star
17

political-positions

Analyze politics.
Python
14
star
18

vikparuchuri.com

Code for vikparuchuri.com -- personal blog.
Ruby
13
star
19

boston-python-ml

Text scoring/classification presentation
JavaScript
9
star
20

percept

A modular machine learning framework that is easy to test and deploy.
Python
9
star
21

wp-deployment

Deploy wordpress with multisite to ec2 with ansible.
Python
7
star
22

spotify-export

Export albums from Spotify into Google Play Music.
Python
7
star
23

pdf_to_md

Python
6
star
24

algorithms

Pure python implementations of various algorithms, including a matrix class.
Python
6
star
25

triton_tutorial

Tutorials for Triton, a language for writing gpu kernels
Jupyter Notebook
5
star
26

vikparuchuri-affirm

CSS
5
star
27

ds-webinar

How to learn data science webinar presentation
CSS
5
star
28

nyt-articles

Get articles from new york times API.
Python
5
star
29

ml-math

Svelte
3
star
30

TulaLensSurvey

Android app that makes it easy to survey people.
Java
3
star
31

medicare-analysis

Analyze medicare data from the recent release.
CSS
3
star
32

sports-stats

Try to rethink sports statistics.
Python
3
star
33

bostonpython2015

Presentation for boston python 2015
CSS
2
star
34

dscontent-starter

2
star
35

Presentations

JavaScript
1
star
36

vik-blog

HTML
1
star
37

tulalens-survey-web

Web component of android survey app.
Ruby
1
star
38

nextml-talk

CSS
1
star
39

vj-wedding2

A site I made for a wedding.
JavaScript
1
star
40

matter

Chrome extension that highlights important passages.
JavaScript
1
star
41

vj-wedding

Placeholder site for a wedding (with countdown)
JavaScript
1
star
42

affirm-themes

Themes for affirm.io.
CSS
1
star
43

openphi

1
star