• Stars
    star
    372
  • Rank 114,858 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extraction of machine-readable zone information from passports, visas and id-cards via OCR

PassportEye: Python tools for image processing of identification documents

https://travis-ci.org/konstantint/PassportEye.svg?branch=master

The package provides tools for recognizing machine readable zones (MRZ) from scanned identification documents. The documents may be located rather arbitrarily on the page - the code tries to find anything resembling a MRZ and parse it from there.

The recognition procedure may be rather slow - around 10 or more seconds for some documents. Its precision is not perfect, yet seemingly decent as far as test documents available to the developer were concerned - in around 80% of the cases, whenever there is a clearly visible MRZ on a page, the system will recognize it and extract the text to the best of the abilities of the underlying OCR engine (Google Tesseract).

The failed examples seem to be most often either clearly badly scanned documents, where text is way too blurred, or, more seriously, some types of IDs (Romanian being one example), where the MRZ is too close to the remaining part of the card - a situation not accounted for too well by the current algorithm.

Installation

The simplest way to install the package is via pip:

$ pip install PassportEye

Note that PassportEye depends on numpy, scipy, matplotlib and scikit-image, among other things. The installation of those requirements, although automatic, may take time or fail sometimes for various reasons (e.g. lack of necessary libraries). If this happens, consider installing the dependencies explicitly from the binary packages, such as those provided by the OS distribution or the "wheel" packages. Another convenient option is to use a Python distribution with pre-packaged numpy/scipy/matplotlib binaries (Anaconda Python being a great choice at the moment).

In addition, you must have the Tesseract OCR installed and added to the system path: the tesseract tool must be accessible at the command line. Note that the most recent version of Tesseract does not by default include its "legacy" model in some installations (e.g. Windows). The legacy model, however shows slightly better performance for MRZ text detection according to our tests and is therefore used by default. If the respective model is not installed by default, you should download the eng.traineddata file [here](https://github.com/tesseract-ocr/tessdata) and replace with it the (smaller) eng.traineddata file that came with your installation.

PassportEye requires Python version 3.6 or higher.

Usage

On installation, the package installs a standalone tool mrz into your Python scripts path. Running:

$ mrz <filename>

will process a given filename, extracting the MRZ information it finds and printing it out in tabular form. Running mrz --json <filename> will output the same information in JSON. Running mrz --save-roi <roi.png> will, in addition, extract the detected MRZ ("region of interest") into a separate png file for further exploration. Note that the tool provides a limited support for PDF files -- it attempts to extract the first DCT-encoded image from the PDF and applies the recognition on it. This seems to work fine with most scanner-produced one-page PDFs, but has not been tested extensively.

If your Tesseract installation has the "legacy" *.traineddata models installed (in its tessdata directory), consider running:

$ mrz --legacy <filename>

This will enable the "legacy" recognizer which, despite the name, seems to work better for MRZ recognition. If you do not know whether you have the relevant files, just try running the command above and see whether you get an error.

In order to use the recognition function in Python code, simply do:

>> from passporteye import read_mrz
>> mrz = read_mrz(image_file)

Where image_file can be either a path to a file on disk, or a byte stream containing image data.

The returned object (unless it is None, which means no ROI was detected) contains the fields extracted from the MRZ along with some metainformation. For the description of the available fields, see the docstring for the passporteye.mrz.text.MRZ class. Note that you can convert the object to a dictionary using the to_dict() method.

If you want to have the ROI reported alongside the MRZ, call the read_mrz function as follows:

>> mrz = read_mrz(image_file, save_roi=True)

The ROI can then be accessed as mrz.aux['roi'] -- it is a numpy ndarray, representing the (grayscale) image region where the OCR was applied.

Finally, in order to use the "legacy recognizer", pass the --oem 0 extra command line argument to Tesseract as follows:

>> mrz = read_mrz(image_file, extra_cmdline_params='--oem 0')

For more flexibility, you may instead use a MRZPipeline object, which will provide you access to all intermediate computations as follows:

>> from passporteye.mrz.image import MRZPipeline
>> p = MRZPipeline(file, extra_cmdline_params='--oem 0')
>> mrz = p.result

The "pipeline" object stores the intermediate computations in its data dictionary. Although you need to understand the underlying algorithm to make sense of it, sometimes it may provide for insightful visualizations. This code, for example, will plot the binarized version of the original image which is used in the algorithm to extract ROIs alongside the boxes corresponding to the extracted ROIs:

>> imshow(p['img_binary'])
>> for b in p['boxes']:
..     plot(b.points[:,1], b.points[:,0], c='b')
..     b.plot()

Development

If you plan to develop or debug the package, consider installing it by running:

$ pip install -e .[dev]

This will install the package in "editable" mode and add a couple of useful extras (such as pytest). You can then run the tests by typing:

$ py.test

At the root of the source distribution.

The command-line script evaluate_mrz can be used to assess the performance of the current recognition pipeline on a set of sample images: this is useful if you want to see the effects of changes to the code. Just run:

$ evaluate_mrz -j 4

(where -j 4 would request to use 4 cores in parallel). The same script may be used to run the recognition pipeline on a given directory of images, sorting successes and failures, see evaluate_mrz -h for options.

Contributing

Feel free to contribute or report issues via Github: https://github.com/konstantint/PassportEye

Copyright & License

Copyright: 2016, Konstantin Tretyakov. License: MIT

More Repositories

1

matplotlib-venn

Area-weighted venn-diagrams for Python/matplotlib
Jupyter Notebook
506
star
2

SKompiler

A tool for compiling trained SKLearn models into other representations (such as SQL, Sympy or Excel formulas)
Python
171
star
3

pyliftover

Pure-python implementation of UCSC liftOver genome coordinate conversion
Python
87
star
4

eid-webauth-samples

eID Smart Card Web Authentication Samples
Shell
42
star
5

python-boilerplate-template

PasteScript template for initializing a new buildout/pytest/travis/setuptools-enabled Python project
Python
24
star
6

skype-chatsync-reader

Parser and GUI viewer of chatsync/\*.dat files from the Skype profile directory
Python
21
star
7

cookiecutter-flask-boilerplate

Boilerplate CookieCutter template for Flask web apps
JavaScript
19
star
8

ComputerGraphics2013

Materials (lecture slides and exercise sessions) for the course Computer Graphics (Fall 2013) at the University of Tartu
C++
13
star
9

hxAudio

Pitch detection and FFT in Haxe
Haxe
8
star
10

intervaltree-bio

Interval tree convenience classes for genomic data
Python
7
star
11

ipyslack

IPython magic for sending notifications to slack
Python
6
star
12

texata-finals-2017

The solution to the case study of the final round of the Texata Big Data Championship 2017
HTML
4
star
13

dotfiles-template

Dotfiles
Makefile
3
star
14

texata-finals-2014

The solution to the case study of the final round of the Texata Big Data Championship 2014
Python
3
star
15

PyENCODE

Python convenience package for accessing ENCODE (Encyclopedia of DNA Elements) project data at UCSC
Python
3
star
16

BreadboardBot

Build instructions and example code for a low-tech educational "robotic platform".
Python
2
star
17

bufalometro

Source code of bufalometro.it - a webapp for analysis of Italian text in the context of fake news detection
Jupyter Notebook
2
star
18

ArduinoSparkfunIRReceiver

Arduino library for using the SparkFun IR receiver breakout (http://www.sparkfun.com/products/8554)
C++
2
star
19

sail-the-wind

An interactive board for a sailing board game based on "Race the Wind"
JavaScript
2
star
20

RobotexSimulator2011

Simulator for the Robotex 2011
Python
2
star
21

steem-stats

Explorations of the Steem data
Jupyter Notebook
2
star
22

eio-pmwiki-skin

PmWiki skin for eio.ut.ee
CSS
1
star
23

pyce

PyCE comptational experiment management framework
Python
1
star
24

TeddyStick

An M5StickC application for use with TeddyCloud.
Python
1
star
25

M5MidiPlayer

CircuitPython MIDI player app for M5 AtomS3 + Synth Unit
Python
1
star
26

oneliners

An package for enriching your Python programs with random stupid one-liner jokes
Python
1
star