• Stars
    star
    175
  • Rank 218,059 (Top 5 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 8 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

ocr-fileformat

Codacy Badge Build Status GitHub release ocr-fileformat Docker build

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Screenshot GUI

Installation

Docker

You can run the command line scripts and web interface as a Docker container, you only need Docker installed.

To start the web interface on http://localhost:8080:

docker run --rm -it -p 8080:8080 ubma/ocr-fileformat

To run the command line scripts, mount the directory containing your input files into the container's /data directory:

docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.alto

System-wide

To install system-wide to /usr/local:

sudo make install

To install without sudo to your home directory:

make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-fileformat

In this example the GUI would be available under http://localhost/ocr-fileformat/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

  • ocr-transform: Transformation of OCR output between OCR formats
  • ocr-validate: Validation of OCR output against OCR format schemas

GUI

The web interface is for testing validation and transformations. You can upload a file or select an input file by URL.

API

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

Usage:
ocr-transform [OPTIONS]   [ []] [-- ]
ocr-transform [OPTIONS]   --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help               Show this help, and exit
ocr-transform [OPTIONS] -v|--version            Show version, and exit
ocr-transform [OPTIONS] -L|--list               List available from/to, and exit

    Options:
        --debug   -d     Increase debug level by 1, can be repeated

    Transformations:
        abbyy hocr
        abbyy page
        alto2.0 alto3.0
        alto2.0 alto3.1
        alto2.0 hocr
        alto2.1 alto3.0
        alto2.1 alto3.1
        alto2.1 hocr
        alto4.2 alto2.1
        alto page
        alto text
        gcv alto
        gcv hocr
        gcv page
        hocr alto2.0
        hocr alto2.1
        hocr page
        hocr text
        page alto
        page hocr
        page page2019
        page text
        tei hocr
        textract page

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

From ╲ To hOCR ALTO PAGEXML
hOCR -
ALTO
PAGEXML
FineReader -
Google Cloud Vision
Amazon AWS Textract - -
TEI - -

Validation

Usage:
ocr-validate [OPTIONS]   []
ocr-validate [OPTIONS] -h|--help       Show this help, and exit
ocr-validate [OPTIONS] -v|--version    Show version, and exit
ocr-validate [OPTIONS] -L|--list       List available schemas, and exit

    Options:
        --debug   -d     Increase debug level by 1, can be repeated

    Schemas:
        hocr
        alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1 alto-4-2 alto-4-3
        abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1
        page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15

Validation CLI

For example, to validate an XML file against the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd

Supported Validation Formats

hOCR ALTO PAGEXML FineReader Google Cloud Vision Amazon AWS Textract
Validation - -

License

This is free software. You may use it under the terms of the MIT License.

During the installation process several projects are included (in ./vendor). These projects have different licenses:

More Repositories

1

zotero-ocr

Zotero Plugin for OCR
JavaScript
446
star
2

spacyopentapioca

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata
Python
90
star
3

bbw

Entity linking, entity typing and relation extraction: Matching CSV to a Wikibase instance (e.g., Wikidata) via Meta-lookup
Python
67
star
4

ocr-gt-tools

Ergonomic line-by-line transcription of scanned text.
JavaScript
46
star
5

zotkat

Erweiterung von Zotero für die Katalogisierung
JavaScript
45
star
6

RaiseWikibase

Knowledge graph construction: Fast inserts into a Wikibase instance
Python
44
star
7

malibu

Mannheim library utilities
PHP
26
star
8

PalMA

PalMA Team Monitor
PHP
25
star
9

ocromore

Process, enhance and evaluate multiple OCR output.
Python
20
star
10

awesome-RDM

A curated list of awesome RDM resources for researchers and organisations
18
star
11

AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15
star
12

crass

Crop And Splice Segments (of scanned pages)
Python
14
star
13

reichsanzeiger-nlp

Reichsanzeiger-NLP: NER/NEL corpus for the German historical newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (1819–1939)
Shell
14
star
14

escriptorium

Clone of https://gitlab.com/scripta/escriptorium.git
Python
13
star
15

Fibeln

Transkriptionen von Fibeln (19. Jahrhundert)
Shell
11
star
16

GTCheck

Check your modified Ground Truth files with visual support!
JavaScript
10
star
17

ape

ALMA Print Extension
PHP
9
star
18

FAIR-Data-Week

FAIR Data Week at Uni Mannheim
9
star
19

digi-gt

Ground truth for the digitized historic collections of UB Mannheim
Shell
9
star
20

digitue-gt

Ground truth for digitized publications of UB Tübingen
Python
8
star
21

sci-work-course

Test Repository for Course "Scientific Writing and Bibliographic Research" @ Uni Mannheim
Shell
8
star
22

mocrin

Multiple OCR-engine interface
Python
8
star
23

uma_publist

TYPO3 extension which shows publications from an EPrints repository
PHP
8
star
24

FAIR-GPT

A documentation for FAIR GPT
8
star
25

blatt

NLP-helper for OCR-ed pages in PAGE XML format
Python
8
star
26

reichsanzeiger-gt

Ground truth for German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (1819–1945)
Shell
8
star
27

ocrd_pagetopdf

OCR-D wrapper for prima-pagetopdf
Shell
7
star
28

vMaBookShelf

Create a virtual Book Shelf, Proxy script for connect to other webpages (ebooks) and a Firefox Add-on vMaBookShelfHelper
Perl
7
star
29

Tesseract_Dokumentation

This repository provides German documentation relating to the text recognition software Tesseract. The documentation was created in the context of the OCR-BW project.
7
star
30

PagePlus

This script processes PAGE XML files, a format widely used in document layout analysis, to perform various operations like validating, repairing, extending, and modifying text regions and lines.
Python
7
star
31

Weisthuemer

Ground truth for Jakob Grimm / Weisthümer
6
star
32

ubapp-android

UB Mannheim - Mannheim University Library App for Android Devices
Java
6
star
33

bibtag-scheduler

Create a pentabarf xml file from the program of the Bibliothekartag
Python
5
star
34

NFDI

Basic info about NFDI consortia. Named entity linking for them. Support for Wikidata WikiProject NFDI.
Python
5
star
35

stabi-berlin-gt

Ground truth for digitized publications of Staatsbibliothek zu Berlin
5
star
36

kitodo-presentation-docker

Docker configuration for Kitodo.Presentation
Shell
4
star
37

charlottenburger-amtsschrifttum

Werkspezifisches Training Charlottenburger Amtsschrifttum (1879–1919)
Python
4
star
38

Reichsanzeiger

Software and data related to "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger"
Shell
4
star
39

JumpHomeMa

Firefox-Add-on: Fügt einen Home-Button auf jeder Seite ein (ausser innerhalb von iFrames). Button wird links unten halbtransparent eingefügt
JavaScript
4
star
40

BeTrial

Bernoulli trial generator to validate OCR results
HTML
3
star
41

UniMA-styles

University of Mannheim styles
3
star
42

ocrd_contrib_ubma

Helper scripts for OCR-D
Python
3
star
43

akf-cdparser

Parsing HTML-files from Aktienführer CDs into structured JSON data
JavaScript
3
star
44

ubapp-ios

UB Mannheim – Mannheim University Library App for iOS Devices
Swift
3
star
45

tudigi-gt

Ground truth for digitized publications of ULB TU Darmstadt
3
star
46

ubma-screenshots

Screenshots describing our software
3
star
47

glpi-wol

Simple Wake On LAN (WOL) plugin for GLPI.
PHP
3
star
48

MArs

MAnnheim reservation system (MArs) is a web application used for seat booking in Mannheim University Library.
PHP
3
star
49

UB-Mannheim.github.io

Website https://ub-mannheim.github.io
2
star
50

eScriptorium_Dokumentation

This repository provides German documentation relating to the text recognition and transcription platform eScriptorium. The documentation was created in the context of the OCR-BW project.
2
star
51

akf-corelib

Library for core functionalites for the Aktienführer-Datenarchiv project
Python
2
star
52

Dokumentation_ocrmypdf

2
star
53

Maschinen-Industrie

Documentation for the project "Maschinen-Industrie"
2
star
54

Aktienfuehrer-Datenarchiv-Tools

This page provides an overview of all the tools developed for the DFG-Projekt "Aktienführer-Datenarchiv II".
2
star
55

Projects

Projects of Mannheim University Library
2
star
56

hkb-gt

Ground truth for a political newspaper of the Mannheim region (1931–1945)
Shell
1
star
57

guacamole-docker

docker-compose configuration and tools for running Apache Guacamole using Docker
Python
1
star
58

ucompanies

Python
1
star
59

madabi

Mannheim Data Bibliography
Python
1
star
60

dach-gt

Ground truth and full text for selected prints of German libraries
Shell
1
star
61

kg-enricher

A library for enriching strings, entities and knowledge graphs using Wikibase knowledge graphs
Python
1
star
62

docxstruct

Parse text-based information and structure it for a db import
Python
1
star
63

Aktienfuehrer-KG

Feedback gathering for the Aktienführer Knowledge Graph
1
star
64

cas2iob

A converter of UIMA CAS XMI files exported from INCEpTION into IOB TSV files with nested NER/NEL tags and components
Python
1
star
65

akf-dbTools

Tools for working with the akf-db (SQLite)
Python
1
star
66

kitodo-production-docker

Dockerfile for kitodo-production
Dockerfile
1
star