Turn your documents into data!
FranΓ§ais | Portuguese | Spanish | δΈζ
-
Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.
-
It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
-
Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.
Table of Contents
Getting Started
Installation
-- The advanced installation guide is available here --
The quickest way to install and run the Parsr API is through the docker image:
docker pull axarev/parsr
If you also wish to install the GUI for sending documents and visualising results:
docker pull axarev/parsr-ui-localhost
Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.
Usage
-- The advanced usage guide is available here --
To run the API, issue:
docker run -p 3001:3001 axarev/parsr
which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.
-
To access the python client to Parsr API, issue:
pip install parsr-client
To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.
- To use the GUI tool (the API needs to already be running), issue:
Then, access it through http://localhost:8080.
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.
The API based usage and the command line usage are documented in the advanced usage guide.
Documentation
All documentation files can be found here.
Contribute
Please refer to the contribution guidelines.
Third Party Licenses
Third Party Libraries licenses for its dependencies:
- QPDF: Apache http://qpdf.sourceforge.net
- ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
- Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
- PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
- Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
- Camelot: MIT https://github.com/camelot-dev/camelot
- MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
- Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc
License
Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).