• Stars
    star
    2,230
  • Rank 20,504 (Top 0.5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep neural network to extract intelligent information from invoice documents.

InvoiceNet Logo


Deep neural network to extract intelligent information from invoice documents.

TL;DR

  • An easy to use UI to view PDF/JPG/PNG invoices and extract information.
  • Train custom models using the Trainer UI on your own dataset.
  • Add or remove invoice fields as per your convenience.
  • Save the extracted information into your system with the click of a button.

We appreciate your star, it helps!

The InvoiceNet logo was designed by Sidhant Tibrewal. Check out his work for some more beautiful designs.


InvoiceNet


DISCLAIMER:

Pre-trained models for some general invoice fields are not available right now but will soon be provided. The training GUI and data preparation scripts have been made available.

Invoice documents contain sensitive information because of which collecting a sizable dataset has proven to be difficult. This makes it difficult for developers like us to train large-scale generalised models and make them available to the community.

If you have a dataset of invoice documents that you are comfortable sharing with us, please reach out ([email protected]). We have the tools to create the first publicly-available large-scale invoice dataset along with a software platform for structured information extraction.


Installation

Ubuntu 18.04 and 20.04

To install InvoiceNet on Ubuntu, run the following commands:

git clone https://github.com/naiveHobo/InvoiceNet.git
cd InvoiceNet/

# Run installation script
./install.sh

The install.sh script will install all the dependencies, create a virtual environment, and install InvoiceNet in the virtual environment.

To be able to use InvoiceNet, you need to source the virtual environment that the package was installed in.

# Source virtual environment
source env/bin/activate

Windows 10

The recommended way is to install InvoiceNet along with its dependencies in an Anaconda environment:

git clone https://github.com/naiveHobo/InvoiceNet.git
cd InvoiceNet/

# Create conda environment and activate
conda create --name invoicenet python=3.7
conda activate invoicenet

# Install InvoiceNet
pip install .

# Install poppler
conda install -c conda-forge poppler

Some dependencies also need to be installed separately on Windows 10 before running InvoiceNet:

Data Preparation

The training data must be arranged in a single directory. The invoice documents are expected be PDF files and each invoice is expected to have a corresponding JSON label file with the same name. Your training data should be in the following format:

train_data/
    invoice1.pdf
    invoice1.json
    nike-invoice.pdf
    nike-invoice.json
    12345.pdf
    12345.json
    ...

The JSON labels should have the following format:

{
 "vendor_name":"Nike",
 "invoice_date":"12-01-2017",
 "invoice_number":"R0007546449",
 "total_amount":"137.51",
 ... other fields
}

To begin the data preparation process, click on the "Prepare Data" button in the GUI or follow the instructions below if you're using the CLI.

Add Your Own Fields

To add your own fields to InvoiceNet, open invoicenet/__init__.py.

There are 4 pre-defined field types:

  • FIELD_TYPES["general"] : General field like names, address, invoice number, etc.
  • FIELD_TYPES["optional"] : Optional fields that might not be present in all invoices.
  • FIELD_TYPES["amount"] : Fields that represent an amount.
  • FIELD_TYPES["date"] : Fields that represent a date.

Choose the appropriate field type for the field and add the line mentioned below.

# Add the following line at the end of the file

# For example, to add a field total_amount
FIELDS["total_amount"] = FIELD_TYPES["amount"]

# For example, to add a field invoice_date
FIELDS["invoice_date"] = FIELD_TYPES["date"]

# For example, to add a field tax_id (which might be optional)
FIELDS["tax_id"] = FIELD_TYPES["optional"]

# For example, to add a field vendor_name
FIELDS["vendor_name"] = FIELD_TYPES["general"]

Using the GUI

InvoiceNet provides you with a GUI to train a model on your data and extract information from invoice documents using this trained model

Trainer

Run the following command to run the trainer GUI:

python trainer.py

Run the following command to run the extractor GUI:

python extractor.py

You need to prepare the data for training first. You can do so by setting the Data Folder field to the directory containing your training data and the clicking the Prepare Data button. Once the data is prepared, you can start training by clicking the Start button.

Using the CLI

Training

Prepare the data for training first by running the following command:

python prepare_data.py --data_dir train_data/

Train InvoiceNet using the following command:

python train.py --field enter-field-here --batch_size 8

# For example, for field 'total_amount'
python train.py --field total_amount --batch_size 8

Prediction

If you are trying to use different ocr, change the ocr_engine in this function before running predict.py create_ngrams.py


Single invoice

To extract a field from a single invoice file, run the following command:

python predict.py --field enter-field-here --invoice path-to-invoice-file

# For example, to extract field total_amount from an invoice file invoices/1.pdf
python predict.py --field total_amount --invoice invoices/1.pdf

Multiple invoices

For extracting information using the trained InvoiceNet model, you just need to place the PDF invoice documents in one directory in the following format:

predict_data/
    invoice1.pdf
    invoice2.pdf
    ...

Run InvoiceNet using the following command:

python predict.py --field enter-field-here --data_dir predict_data/

# For example, for field 'total_amount'
python predict.py --field total_amount --data_dir predict_data/

Reference

This implementation is largely based on the work of R. Palm et al, who should be cited if this is used in a scientific publication (or the preceding conference papers):

[1] Palm, Rasmus Berg, Florian Laws, and Ole Winther. "Attend, Copy, Parse End-to-end information extraction from documents." 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019.

@inproceedings{palm2019attend,
  title={Attend, Copy, Parse End-to-end information extraction from documents},
  author={Palm, Rasmus Berg and Laws, Florian and Winther, Ole},
  booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
  pages={329--336},
  year={2019},
  organization={IEEE}
}

Note

An implementation of an inferior (also slightly broken) invoice handling system based on the paper "Cloudscan - A configuration-free invoice analysis system using recurrent neural networks." is available here.

[2] Palm, Rasmus Berg, Ole Winther, and Florian Laws. "Cloudscan - A configuration-free invoice analysis system using recurrent neural networks." 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017.

@inproceedings{palm2017cloudscan,
  title={Cloudscan-a configuration-free invoice analysis system using recurrent neural networks},
  author={Palm, Rasmus Berg and Winther, Ole and Laws, Florian},
  booktitle={2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)},
  volume={1},
  pages={406--413},
  year={2017},
  organization={IEEE}
}

More Repositories

1

pdfviewer

PDFViewer is a GUI tool, written using python3 and tkinter, which lets you view PDF documents.
Python
57
star
2

TextRank

Implementation of TextRank with the option of using pre-trained Word2Vec embeddings as the similarity metric
Python
49
star
3

RRTPlanner

ROS package for a 2D path planner using the Rapidly Exploring Random Trees (RRT) algorithm
C++
25
star
4

HoboNet

Convolution Neural Network for classification of semantic relations in a sentence
Python
18
star
5

HoboBERT

Ensemble of 10 modified BERT Base models for prediction of best answers for queries on search engines.
Python
16
star
6

person_tracking

ROS package to track and follow a target person
C++
14
star
7

MortyFire

Generating Rick And Morty episodes using NLP and Deep Learning
Python
12
star
8

PostOCR

PostOCR is a GUI tool, written using python3 and tkinter, which detects and corrects errors that creep in after running an OCR on a PDF document.
Python
11
star
9

JDL

Java Deep Learning Library
Java
9
star
10

Smart-I

Smart-I is an android application aimed at helping the visually impaired using artificial intelligence and cloud computing.
Python
9
star
11

Rambo

Driving Imitation
Python
8
star
12

CaptionNet---TensorFlow

An encoder-decoder based deep neural network for image captioning
Python
5
star
13

RickAndMorty-EpisodeGenerator

A recurrent neural network trained on Rick And Morty transcripts to generate scripts for new episodes
Python
3
star
14

CaptionNet---PyTorch

An encoder-decoder based deep neural network for image captioning implemented using PyTorch
Python
2
star
15

NumJ

Multithreaded matrix handling library for Java
Java
2
star
16

ObstacleAvoidingRobot

Basic arduino based obstacle avoiding robot
Arduino
1
star