• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language Common Workflow Language
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Toolbox for OCR post-correction

Ochre

Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress!

  • Overview of OCR post-correction data sets
  • Preprocess data sets
  • Train character-based language models/LSTMs for OCR post-correction
  • Do the post-correction
  • Assess the performance of OCR post-correction
  • Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.

Data sets

Installation

git clone [email protected]:KBNLresearch/ochre.git
cd ochre
pip install -r requirements.txt
python setup.py develop
  • Using the CWL workflows requires (the development version of) nlppln and its requirements (see installation guidelines).
  • To run a CWL workflow type: cwltool|cwl-runner path/to/workflow.cwl <inputs> (if you run the command without inputs, the tool will tell you about what inputs are required and how to specify them). For more information on running CWL workflows, have a look at the nlppln documentation. This is especially relevant for Windows users.
  • Please note that some of the CWL workflows contain absolute paths, if you want to use them on your own machine, regenerate them using the associated Jupyter Notebooks.

Preprocessing

The software needs the data in the following formats:

  • ocr: text files containing the ocr-ed text, one file per unit (article, page, book, etc.)
  • gs: text files containing the gold standard (correct) text, one file per unit (article, page, book, etc.)
  • aligned: json files containing aligned character sequences:
{
    "ocr": ["E", "x", "a", "m", "p", "", "c"],
    "gs": ["E", "x", "a", "m", "p", "l", "e"]
}

Corresponding files in these directories should have the same name (or at least the same prefix), for example:

├── gs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
├── ocr
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── aligned
    ├── 1.json
    ├── 2.json
    └── 3.json

To create data in these formats, CWL workflows are available. First run a preprocess workflow to create the gs and ocr directories containing the expected files. Next run an align workflow to create the align directory.

To create the alignments, run one of:

  • align-dir-pack.cwl to align all files in the gs and ocr directories
  • align-test-files-pack.cwl to align the test files in a data division

These workflows can be run as stand-alone; associated notebook align-workflow.ipynb.

Training networks for OCR post-correction

First, you need to divide the data into a train, validation and test set:

python -m ochre.create_data_division /path/to/aligned

The result of this command is a json file containing lists of file names, for example:

{
    "train": ["1.json", "2.json", "3.json", "4.json", "5.json", ...],
    "test": ["6.json", ...],
    "val": ["7.json", ...]
}
  • Script: lstm_synched.py

OCR post-correction

If you trained a model, you can use it to correct OCR text using the lstm_synced_correct_ocr command:

python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

or

cwltool /path/to/ochre/cwl/lstm_synced_correct_ocr.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --txt /path/to/ocr/text/file

The command creates a text file containing the corrected text.

To generate corrected text for the test files of a dataset, do:

cwltool /path/to/ochre/cwl/post_correct_test_files.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --datadivision /path/to/data/division --in_dir /path/to/directory/with/ocr/text/files

To run it for a directory of text files, use:

cwltool /path/to/ochre/cwl/post_correct_dir.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --in_dir /path/to/directory/with/ocr/text/files

(these CWL workflows can be run as stand-alone; associated notebook post_correction_workflows.ipynb)

  • Explain merging of predictions

Performance

To calculate performance of the OCR (post-correction), the external tool ocrevalUAtion is used. More information about this tool can be found on the website and wiki.

Two workflows are available for calculating performance. The first calculates performance for all files in a directory. To use it type:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-wf-pack.cwl#main --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

The second calculates performance for all files in the test set:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-test-files-wf-pack.cwl --datadivision /path/to/datadivision.json --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

Both of these workflows are stand-alone (packed). The corresponding Jupyter notebook is ocr-evaluation-workflow.ipynb.

To use the ocrevalUAtion tool in your workflows, you have to add it to the WorkflowGenerator's steps library:

wf.load(step_file='https://raw.githubusercontent.com/nlppln/ocrevaluation-docker/master/ocrevaluation.cwl')
  • TODO: explain how to calculate performance with ignore case (or use lowercase-directory.cwl)

OCR error analysis

Different types of OCR errors exist, e.g., structural vs. random mistakes. OCR post-correction methods may be suitable for fixing different types of errors. Therefore, it is useful to gain insight into what types of OCR errors occur. We chose to approach this problem on the word level. In order to be able to compare OCR errors on the word level, words in the OCR text and gold standard text need to be mapped. CWL workflows are available to do this. To create word mappings for the test files of a dataset, use:

cwltool  /path/to/ochre/cwl/word-mapping-test-files.cwl --data_div /path/to/datadivision --gs_dir /path/to/directory/containing/the/gold/standard/texts --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

To create word mappings for two directories of files, do:

cwltool  /path/to/ochre/cwl/word-mapping-wf.cwl --gs_dir /path/to/directory/containing/the/gold/standard/texts/ --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

(These workflows can be regenerated using the notebook word-mapping-workflow.ipynb.)

The result is a csv-file containing mapped words. The first column contains a word id, the second column the gold standard text and the third column contains the OCR text of the word:

,gs,ocr
0,Hello,Hcllo
1,World,World
2,!,.

This csv file can be used to analyze the errors. See notebooks/categorize errors based on word mappings.ipynb for an example.

We use heuristics to categorize the following types of errors (ochre/ocrerrors.py):

  • TODO: add error types

OCR quality measure

Jupyter notebook

  • better (more balanced) training data is needed.

Generating training data

  • Scramble gold standard text

Ideas

  • Visualization of probabilities for each character (do the ocr mistakes have lower probability?) (probability=color)

License

Copyright (c) 2017-2018, Koninklijke Bibliotheek, Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

europeananp-ner

Named Entities Recognition Annotator Tool for Europeana Newspapers
Java
60
star
2

isolyzer

Verify size of ISO 9660 image against Volume Descriptor fields
Python
43
star
3

keyword-generator

Command-line tool to extract a ranked list of relevant keywords from a corpus with the option of using either topic modeling or tf-idf scores.
Python
41
star
4

iromlab

Loader software for automated imaging of optical media with Nimbie disc robot
HTML
31
star
5

tapeimgr

Simple tape imaging and extraction tool
Python
27
star
6

KB-python-API

Python API for KB data-services
Python
18
star
7

europeananp-dbpedia-disambiguation

Python
17
star
8

forensicImagingResources

Shell
16
star
9

alto-editor

Browser based post correction tool for Alto XML files
JavaScript
12
star
10

dac

Entity linker for the newspaper collection of the National Library of the Netherlands. Links named entity mentions to DBpedia descriptions using either a binary SVM classifier or a neural net.
Python
12
star
11

frame-generator

Tool for extracting topics, keywords and their collocates from a Dutch corpus. Includes and extends the functionality of the Keyword Generator.
Python
9
star
12

diskimgr

Simple workflow tool for imaging block devices
Python
9
star
13

dac-web

Web interface to manually annotate named entity mentions in newspaper articles with the correct DBpedia link(s), if any. Produces labeled data sets for training and evaluating the DAC Entity Linker.
Python
9
star
14

scansion-generator

Command-line tool that generates a scansion for modern Dutch metric poetry.
Python
8
star
15

genre-classifier

Genre classifier for Dutch historical newspaper articles.
Python
7
star
16

omimgr

Simple workflow tool for imaging optical media
Python
7
star
17

omSipCreator

Create ingest-ready SIPs from batches of optical media images
Python
7
star
18

multiNER

Multiple NER-tool's combined in one output. Incovating mutliple NER-engine's in parallel.
Python
6
star
19

siamese

Advertisement search interface based on image similarity.
Python
6
star
20

openjpeg-decoder-service

A java based jp2 decoder service.
Java
5
star
21

jp2totiff

Shell
5
star
22

textExtractDemo

Text extraction demo
Python
5
star
23

xml-workshop

Automatically extract text, layout and metadata information from XML-files of OCR-ed historical texts
Jupyter Notebook
4
star
24

ocropus-wrapper

Simple Python wrapper for ocropus command line invocation
Python
4
star
25

Ebook-Fixer

JavaScript
3
star
26

DBNL-canonicity

KB RiR project to Collect a corpus of Dutch novels 1800-2000 and Investigate Canonicity
Python
3
star
27

enhance_ocr

Enhance OCR of newspapers archive
Python
3
star
28

ipmlab

Image Portable Media Like A Boss
Python
3
star
29

chatbot-builder-nl

JavaScript
3
star
30

ebooks-qa

Scripts for quality assessment of e-books
Python
3
star
31

jp2view

experimental java jp2 viewer using jni bindings with openjpeg2.0
Java
3
star
32

iromsgl

Single-disc version of Iromlab
HTML
2
star
33

dictionary-viewer

View the number of newspaper articles per year containing a user-specified minimum number of keywords.
JavaScript
2
star
34

detectDamagedAudio

Tests on how to detect damaged WAV files
HTML
2
star
35

spatio-temporal-topics

Python
2
star
36

genre-classifier-gui

Web interface for the genre classifier.
HTML
2
star
37

CHRONIC

Classified Historical Newspaper Images
HTML
2
star
38

frame-generator-gui

Web interface for the Frame Generator.
JavaScript
2
star
39

xs4all-resources

Scripts and documentation related to the xs4all homepage rescue efforts
Python
2
star
40

cdtestcorpus

Scripts and data for creating test CDs using different CD layouts
HTML
1
star
41

tikadetect-tree

Bash script that performs file format identification on all files in a directory tree using Apache Tika
Shell
1
star
42

gado2

Dutch/Indonesian BERT-NER setup.
C++
1
star
43

topics

Predict news article topics and DBpedia description topics and type.
Jupyter Notebook
1
star
44

summerSchoolPDFEpub

Achtergrondinformatie en verdiepende materialen bij het KB Summerschool onderdeel PDF en EPUB
1
star
45

ProtoCST

A prototype webapplication for corpus selection, inspection and export
HTML
1
star
46

mvds

Monitor van de stad
Python
1
star
47

IwI22_ARTIST

This repository contains the Jupyter Notebooks and other information as created during ICT With Industry 2022
Jupyter Notebook
1
star
48

bb_recog

Book back recognition
1
star
49

magic-file-java-6

Experimental Java binding for libmagic file characterisation
Java
1
star
50

zenodoReports

Fetch metadata and generate reports for a Zenodo community.
Python
1
star
51

hack4europe

Javascript based portal for searching Europeana collections and creating enrichments on the metadata.
JavaScript
1
star
52

Annif_data_exp

Automatic subject assignment for KB ebooks using Annif.
Jupyter Notebook
1
star
53

Hackalod

This is the github repo of the Koninklijke Bibliotheek (KB) created for the Hackalod 2021 (https://hackalod.com/)
PHP
1
star
54

dbpedia-indexer

Collection of Python scripts to build a Solr index from selected Dutch and English DBpedia dumps.
Python
1
star
55

intro-kb-apis

Materials for the RUG workshop on the KB search and harvest APIs.
1
star
56

EntangledHistories

Processing of Transkribus output using xslt and running it through Annif
Jupyter Notebook
1
star
57

Demosaurus

Demo web application that supports author attribution (thesaureren) and topic attribution (subject indexing). Annif is used for the latter.
Jupyter Notebook
1
star