• Stars
    star
    140
  • Rank 261,473 (Top 6 %)
  • Language
    C++
  • License
    GNU General Publi...
  • Created almost 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

pdftojson

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

Compile

./configure
make

On MacOS, you might need to specify libpng and libfreetype locations, e.g.

./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/  --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/

You will find pdftojson inside the directory xpdf/pdftojson

Usage

pdftojson <input.pdf> <output.json>

File format

The JSON produced looks like: [ { "pages":14, "number":1, "width":612, "height":792, "text":[ [115,162,41,14,0,"What "], ... ] }, { "pages":14, "number":2, "width":612, "height":792, "text":[ [115,162,41,14,0,"Here "], ... ] }, ... ];

For each page, the text array contains: [top,left,width,height,0,text]