• Stars
    star
    15,391
  • Rank 1,893 (Top 0.04 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Convert PDF to markdown quickly with high accuracy

Marker

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

  • Support for a range of PDF documents (optimized for books and scientific papers)
  • Removes headers/footers/other artifacts
  • Converts most equations to latex
  • Formats code blocks and tables
  • Support for multiple languages (although most testing is done in English). See settings.py for a language list.
  • Works on GPU, CPU, or MPS

How it works

Marker is a pipeline of deep learning models:

Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper: We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents. In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.

Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.

Examples

PDF Type Marker Nougat
Think Python Textbook View View
Think OS Textbook View View
Switch Transformers arXiv paper View View
Multi-column CNN arXiv paper View View

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

  • Marker will convert fewer equations to latex than nougat. This is because it has to first detect equations, then convert them without hallucation.
  • Whitespace and indentations are not always respected.
  • Not all lines/spans will be joined properly.
  • Only languages similar to English (Spanish, French, German, Russian, etc) are supported. Languages with different character sets (Chinese, Japanese, Korean, etc) are not.
  • This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry.

First, clone the repo:

  • git clone https://github.com/VikParuchuri/marker.git
  • cd marker

Linux

  • Install system requirements
    • Optional: Install tesseract 5 by following these instructions or running scripts/install/tesseract_5_install.sh.
    • Install ghostscript > 9.55 by following these instructions or running scripts/install/ghostscript_install.sh.
    • Install other requirements with cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
  • Set the tesseract data folder path
    • Find the tesseract data folder tessdata with find / -name tessdata. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
    • Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
  • Install python requirements
    • poetry install
    • poetry shell to activate your poetry venv
  • Update pytorch since poetry doesn't play nicely with it
    • GPU only: run pip install torch to install other torch dependencies.
    • CPU only: Uninstall torch with poetry remove torch, then follow the CPU install instructions.

Mac

  • Install system requirements from scripts/install/brew-requirements.txt
  • Set the tesseract data folder path
    • Find the tesseract data folder tessdata with brew list tesseract
    • Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it
  • Install python requirements
    • poetry install
    • poetry shell to activate your poetry venv

Usage

First, some configuration:

  • Set your torch device in the local.env file. For example, TORCH_DEVICE=cuda or TORCH_DEVICE=mps. cpu is the default.
    • If using GPU, set INFERENCE_RAM to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set INFERENCE_RAM=16.
    • Depending on your document types, marker's average memory usage per task can vary slightly. You can configure VRAM_PER_TASK to adjust this if you notice tasks failing with GPU out of memory errors.
  • Inspect the other settings in marker/settings.py. You can override any settings in the local.env file, or by setting environment variables.
    • By default, the final editor model is off. Turn it on with ENABLE_EDITOR_MODEL.
    • By default, marker will use ocrmypdf for OCR, which is slower than base tesseract, but higher quality. You can change this with the OCR_ENGINE setting.

Convert a single file

Run convert_single.py, like this:

python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10
  • --parallel_factor is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.
  • --max_pages is the maximum number of pages to process. Omit this to convert the entire document.

Make sure the DEFAULT_LANG setting is set appropriately for your document.

Convert multiple files

Run convert.py, like this:

python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
  • --workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
  • --max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
  • --metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. The format is:
  • --min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
{
  "pdf1.pdf": {"language": "English"},
  "pdf2.pdf": {"language": "Spanish"},
  ...
}

Convert multiple files on multiple GPUs

Run chunk_convert.sh, like this:

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
  • METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
  • NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
  • NUM_WORKERS is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK.
  • MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.

Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method Average Score Time per page Time per document
naive 0.350727 0.00152378 0.326524
marker 0.641062 0.360622 77.2762
nougat 0.629211 3.77259 808.413

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method switch_trans.pdf crowd.pdf multicolcnn.pdf thinkos.pdf thinkdsp.pdf thinkpython.pdf
naive 0.244114 0.140669 0.0868221 0.366856 0.412521 0.468281
marker 0.482091 0.466882 0.537062 0.754347 0.78825 0.779536
nougat 0.696458 0.552337 0.735099 0.655002 0.645704 0.650282

Peak GPU memory usage during the benchmark is 3.3GB for nougat, and 3.1GB for marker. Benchmarks were run on an A6000.

Throughput

Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip.

Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Commercial usage

Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at [email protected].

Here are the non-commercial/restrictive dependencies:

Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

  • Nougat from Meta
  • Layoutlmv3 from Microsoft
  • DocLayNet from IBM
  • ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!

More Repositories

1

surya

OCR, layout analysis, reading order, line detection in 90+ languages
Python
9,453
star
2

apartment-finder

A Slack bot that helps you find an apartment.
Python
1,061
star
3

zero_to_gpt

Go from no deep learning knowledge to implementing GPT.
Jupyter Notebook
940
star
4

texify

Math OCR model that outputs LaTeX and markdown
Python
673
star
5

textbook_quality

Generate textbook-quality synthetic LLM pretraining data
Python
467
star
6

pdftext

Extract structured text from pdfs quickly
Python
261
star
7

libgen_to_txt

Convert all of libgen to high quality markdown
Python
235
star
8

scribe

Simple speech recognition using your microphone.
Python
123
star
9

researcher

Concise answers to search queries using Google and GPT-3. Includes citations.
Python
72
star
10

scan

Score essays automatically with an easy web interface.
Python
41
star
11

evolve-music2

Evolve music automatically with python -- rewrite of evolve-music.
Python
40
star
12

classified

Score LLM pretraining data with classifiers
Python
38
star
13

evolve-music

Superseded by github.com/vikparuchuri/evolve-music2 -- use that instead.
C
25
star
14

simpsons-scripts

Find out how much the simpsons characters like each other with text and audio analysis.
Python
24
star
15

movide

The student-centric learning platform.
Python
18
star
16

snapcheck

Find out if your info was leaked.
Python
15
star
17

political-positions

Analyze politics.
Python
14
star
18

vikparuchuri.com

Code for vikparuchuri.com -- personal blog.
Ruby
13
star
19

boston-python-ml

Text scoring/classification presentation
JavaScript
9
star
20

percept

A modular machine learning framework that is easy to test and deploy.
Python
9
star
21

wp-deployment

Deploy wordpress with multisite to ec2 with ansible.
Python
7
star
22

spotify-export

Export albums from Spotify into Google Play Music.
Python
7
star
23

pdf_to_md

Python
6
star
24

algorithms

Pure python implementations of various algorithms, including a matrix class.
Python
6
star
25

triton_tutorial

Tutorials for Triton, a language for writing gpu kernels
Jupyter Notebook
5
star
26

vikparuchuri-affirm

CSS
5
star
27

ds-webinar

How to learn data science webinar presentation
CSS
5
star
28

nyt-articles

Get articles from new york times API.
Python
5
star
29

ml-math

Svelte
3
star
30

TulaLensSurvey

Android app that makes it easy to survey people.
Java
3
star
31

medicare-analysis

Analyze medicare data from the recent release.
CSS
3
star
32

sports-stats

Try to rethink sports statistics.
Python
3
star
33

bostonpython2015

Presentation for boston python 2015
CSS
2
star
34

dscontent-starter

2
star
35

Presentations

JavaScript
1
star
36

vik-blog

HTML
1
star
37

tulalens-survey-web

Web component of android survey app.
Ruby
1
star
38

nextml-talk

CSS
1
star
39

vj-wedding2

A site I made for a wedding.
JavaScript
1
star
40

matter

Chrome extension that highlights important passages.
JavaScript
1
star
41

vj-wedding

Placeholder site for a wedding (with countdown)
JavaScript
1
star
42

affirm-themes

Themes for affirm.io.
CSS
1
star
43

openphi

1
star