• Stars
    star
    103
  • Rank 333,046 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tools for running OCR against files stored in S3

s3-ocr

PyPI Changelog Tests License

Tools for running OCR against files stored in S3

Background on this project: s3-ocr: Extract text from PDF files stored in an S3 bucket

Installation

Install this tool using pip:

pip install s3-ocr

Demo

You can see the results of running this tool against three PDFs from the Internet Archive (one, two, three) in this example table hosted using Datasette.

Starting OCR against PDFs in a bucket

The start command takes a list of keys and submits them to Textract for OCR processing.

You need to have AWS configured using environment variables, credentials file in your home directory or a JSON or INI file generated using s3-credentials.

You can start the process running like this:

s3-ocr start name-of-your-bucket my-pdf-file.pdf

The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this:

s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf

OCR can take some time. The results of the OCR will be stored in textract-output in your bucket.

To process every file in the bucket with a .pdf extension use --all:

s3-ocr start name-of-bucket --all

To process every file with a .pdf extension within a specific folder, use --prefix:

s3-ocr start name-of-bucket --prefix path/to/folder

s3-ocr start --help

Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]...

  Start OCR tasks for PDF files in an S3 bucket

      s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf

  To process every file with a .pdf extension:

      s3-ocr start name-of-bucket --all

  To process every .pdf in the PUBLIC/ folder:

      s3-ocr start name-of-bucket --prefix PUBLIC/

Options:
  --all                 Process all PDF files in the bucket
  --prefix TEXT         Process all PDF files within this prefix
  --dry-run             Show what this would do, but don't actually do it
  --no-retry            Don't retry failed requests
  --access-key TEXT     AWS access key ID
  --secret-key TEXT     AWS secret access key
  --session-token TEXT  AWS session token
  --endpoint-url TEXT   Custom endpoint URL
  -a, --auth FILENAME   Path to JSON/INI file containing credentials
  --help                Show this message and exit.

Checking status

The s3-ocr status <bucket-name> command shows a rough indication of progress through the tasks:

% s3-ocr status sfms-history
153 complete out of 532 jobs

It compares the jobs that have been submitted, based on .s3-ocr.json files, to the jobs that have their results written to the textract-output/ folder.

s3-ocr status --help

Usage: s3-ocr status [OPTIONS] BUCKET

  Show status of OCR jobs for a bucket

Options:
  --access-key ...

Inspecting a job

The s3-ocr inspect-job <job_id> command can be used to check the status of a specific job ID:

% s3-ocr inspect-job b267282745685226339b7e0d4366c4ff6887b7e293ed4b304dc8bb8b991c7864
{
  "DocumentMetadata": {
    "Pages": 583
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

s3-ocr inspect-job --help

Usage: s3-ocr inspect-job [OPTIONS] JOB_ID

  Show the current status of an OCR job

      s3-ocr inspect-job <job_id>

Options:
  --access-key ...

Fetching the results

Once an OCR job has completed you can download the resulting JSON using the fetch command:

s3-ocr fetch name-of-bucket path/to/file.pdf

This will save files in the current directory with names like this:

  • 4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json
  • 4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json

The number of files will vary depending on the length of the document.

If you don't want separate files you can combine them together using the -c/--combine option:

s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json

The output.json file will then contain data that looks something like this:

{
  "Blocks": [
    {
      "BlockType": "PAGE",
      "Geometry": {...}
      "Page": 1,
      ...
    },
    {
      "BlockType": "LINE",
      "Page": 1,
      ...
      "Text": "Barry",
    },

s3-ocr fetch --help

Usage: s3-ocr fetch [OPTIONS] BUCKET KEY

  Fetch the OCR results for a specified file

      s3-ocr fetch name-of-bucket path/to/key.pdf

  This will save files in the current directory called things like

      a806e67e504fc15f...48314e-1.json     a806e67e504fc15f...48314e-2.json

  To combine these together into a single JSON file with a specified name, use:

      s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json

  Use "--output -" to print the combined JSON to standard output instead.

Options:
  -c, --combine FILENAME  Write combined JSON to file
  --access-key ...

Fetching just the text of a page

If you don't want to deal with the JSON directly, you can use the text command to retrieve just the text extracted from a PDF:

s3-ocr text name-of-bucket path/to/file.pdf

This will output plain text to standard output.

To save that to a file, use this:

s3-ocr text name-of-bucket path/to/file.pdf > text.txt

Separate pages will be separated by three newlines. To separate them using a ---- horizontal divider instead add --divider:

s3-ocr text name-of-bucket path/to/file.pdf --divider

s3-ocr text --help

Usage: s3-ocr text [OPTIONS] BUCKET KEY

  Retrieve the text from an OCRd PDF file

      s3-ocr text name-of-bucket path/to/key.pdf

Options:
  --divider             Add ---- between pages
  --access-key ...

Avoiding processing duplicates

If you move files around within your S3 bucket s3-ocr can lose track of which files have already been processed. This can lead to additional Textract charges for processing should you run s3-ocr start against those new files.

The s3-ocr dedupe command addresses this by scanning your bucket for files that have a new name but have previously been processed. It does this by looking at the ETag for each file, which represents the MD5 hash of the file contents.

The command will write out new .s3ocr.json files for each detected duplicate. This will avoid those duplicates being run those duplicates through OCR a second time should yo run s3-ocr start.

s3-ocr dedupe name-of-bucket

Add --dry-run for a preview of the changes that will be made to your bucket.

s3-ocr dedupe --help

Usage: s3-ocr dedupe [OPTIONS] BUCKET

  Scan every file in the bucket checking for duplicates - files that have not
  yet been OCRd but that have the same contents (based on ETag) as a file that
  HAS been OCRd.

      s3-ocr dedupe name-of-bucket

Options:
  --dry-run             Show output without writing anything to S3
  --access-key ...

Changes made to your bucket

To keep track of which files have been submitted for processing, s3-ocr will create a JSON file for every file that it adds to the OCR queue.

This file will be called:

path-to-file/name-of-file.pdf.s3-ocr.json

Each of these JSON files contains data that looks like this:

{
  "job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
  "etag": "\"b0c77472e15500347ebf46032a454e8e\""
}

The recorded job_id can be used later to associate the file with the results of the OCR task in textract-output/.

The etag is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.

This design for the tool, with the .s3-ocr.json files tracking jobs that have been submitted, means that it is safe to run s3-ocr start against the same bucket multiple times without the risk of starting duplicate OCR jobs.

Creating a SQLite index of your OCR results

The s3-ocr index <bucket> <database_file> command creates a SQLite database containing the results of the OCR, and configures SQLite full-text search against the text:

% s3-ocr index sfms-history index.db
Fetching job details  [####################################]  100%
Populating pages table  [####################----------------]   55%  00:03:18

The schema of the resulting database looks like this (excluding the FTS tables):

CREATE TABLE [pages] (
   [path] TEXT,
   [page] INTEGER,
   [folder] TEXT,
   [text] TEXT,
   PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
   [key] TEXT PRIMARY KEY,
   [job_id] TEXT,
   [etag] TEXT,
   [s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
   [job_id] TEXT PRIMARY KEY
);

The database is designed to be used with Datasette.

s3-ocr index --help

Usage: s3-ocr index [OPTIONS] BUCKET DATABASE

  Create a SQLite database with OCR results for files in a bucket

Options:
  --access-key ...

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd s3-ocr
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

To regenerate the README file with the latest --help:

cog -r README.md

More Repositories

1

datasette

An open source multi-tool for exploring and publishing data
Python
7,807
star
2

sqlite-utils

Python CLI utility and library for manipulating SQLite databases
Python
1,191
star
3

shot-scraper

A command-line utility for taking automated screenshots of websites
Python
1,006
star
4

csvs-to-sqlite

Convert CSV files into a SQLite database
Python
758
star
5

til

Today I Learned
HTML
719
star
6

django-sql-dashboard

Django app for building dashboards using raw SQL queries
Python
400
star
7

simonw

https://simonwillison.net/2020/Jul/10/self-updating-profile-readme/
Python
362
star
8

llm

Access large language models from the command-line
Python
309
star
9

db-to-sqlite

CLI tool for exporting tables or queries from any SQL database to a SQLite file
Python
302
star
10

djangode

Utilities functions for node.js that borrow some useful concepts from Django
JavaScript
256
star
11

csv-diff

Python CLI tool and library for diffing CSV and JSON files
Python
238
star
12

datasette-lite

Datasette running in your browser using WebAssembly and Pyodide
HTML
237
star
13

shot-scraper-template

Template repository for setting up shot-scraper
217
star
14

geocoders

Ultra simple API for geocoding a single string against various web services.
Python
184
star
15

ca-fires-history

Tracking fire data from www.fire.ca.gov
165
star
16

django-openid

A modern library for integrating OpenID with Django - incomplete, but really nearly there (promise)
Python
163
star
17

openai-to-sqlite

Save OpenAI API results to a SQLite database
Python
161
star
18

action-transcription

A tool for creating a repository of transcribed videos
Python
158
star
19

s3-credentials

A tool for creating credentials for accessing S3 buckets
Python
149
star
20

git-history

Tools for analyzing Git history using SQLite
Python
147
star
21

google-drive-to-sqlite

Create a SQLite database containing metadata from Google Drive
Python
142
star
22

django-queryset-transform

Experimental .transform(fn) method for Django QuerySets, for clever lazily evaluated optimisations.
Python
142
star
23

ratelimitcache

A memcached backed rate limiting decorator for Django.
Python
141
star
24

optfunc

Syntactic sugar for creating Python command line scripts by introspecting a function definition
Python
134
star
25

djng

Turtles all the way down
Python
129
star
26

cougar-or-not

An API for identifying cougars v.s. bobcats v.s. other USA cat species
Jupyter Notebook
119
star
27

simonwillisonblog

The source code behind my blog
JavaScript
118
star
28

advent-of-code-2022-in-rust

Copilot-assisted Advent of Code 2022 to learn Rust
Rust
114
star
29

djangopeople.net

A geographical community site for Django developers.
Python
111
star
30

scrape-chatgpt-plugin-prompts

Shell
107
star
31

datasette-app

The Datasette macOS application
JavaScript
100
star
32

django-redis-monitor

Request per second / SQLop per second monitoring for Django, using Redis for storage
Python
97
star
33

python-lib

Opinionated cookiecutter template for creating a new Python library
Python
97
star
34

ttok

Count and truncate text based on tokens
Python
96
star
35

mytweets

Script for saving a JSON archive of your tweets.
Python
81
star
36

airtable-export

Export Airtable data to YAML, JSON or SQLite files on disk
Python
79
star
37

datasette-graphql

Datasette plugin providing an automatic GraphQL API for your SQLite databases
Python
77
star
38

llm-mlc

LLM plugin for running models using MLC
Python
74
star
39

strip-tags

CLI tool for stripping tags from HTML
Python
73
star
40

django_cropper

Integration of jCrop with the Django admin
Python
71
star
41

click-app

Cookiecutter template for creating new Click command-line tools
Python
70
star
42

datasette-ripgrep

Web interface for searching your code using ripgrep, built as a Datasette plugin
Python
69
star
43

download-esm

Download ESM modules from npm and jsdelivr
Python
67
star
44

datasette.io

The official project website for Datasette
HTML
66
star
45

ftfy-web

Paste in some broken unicode text and FTFY will tell you how to fix it!
Python
63
star
46

markdown-to-sqlite

CLI tool for loading markdown files into a SQLite database
Python
63
star
47

sqlite-history

Track changes to SQLite tables using triggers
Python
62
star
48

yaml-to-sqlite

Utility for converting YAML files to SQLite
Python
62
star
49

sqlite-diffable

Tools for dumping/loading a SQLite database to diffable directory structure
Python
62
star
50

covid-19-datasette

Deploys a Datasette instance of COVID-19 data from Johns Hopkins CSSE and the New York Times
Python
61
star
51

dogproxy

Experimental HTTP proxy (using node.js) for avoiding the dog pile effect.
JavaScript
61
star
52

soupselect

CSS selector support for BeautifulSoup.
Python
60
star
53

laion-aesthetic-datasette

Use Datasette to explore LAION improved_aesthetics_6plus training data used by Stable DIffusion
Python
58
star
54

datasette-cluster-map

Datasette plugin that shows a map for any data with latitude/longitude columns
JavaScript
55
star
55

action-transcription-demo

A tool for creating a repository of transcribed videos
Python
53
star
56

datasette-vega

Datasette plugin for visualizing data using Vega
JavaScript
52
star
57

pge-outages-pre-2024

Tracking PG&E outages
Python
51
star
58

google-calendar-to-sqlite

Create a SQLite database containing your data from Google Calendar
Python
50
star
59

url-map

Use URL parameters to generate a map with markers, using Leaflet and OpenStreetMap
HTML
49
star
60

disaster-scrapers

Scrapers for disaster data - writes to https://github.com/simonw/disaster-data
Python
46
star
61

djp

A plugin system for Django
Python
46
star
62

geojson-to-sqlite

CLI tool for converting GeoJSON files to SQLite (with SpatiaLite)
Python
45
star
63

asgi-csrf

ASGI middleware for protecting against CSRF attacks
Python
44
star
64

datasette-chatgpt-plugin

A Datasette plugin that turns a Datasette instance into a ChatGPT plugin
Python
44
star
65

nodecast

A simple comet broadcast server, originally implemented as a demo for Full Frontal 2009.
JavaScript
44
star
66

bugle_project

Group collaboration tools for hackers in forts.
Python
42
star
67

django-html

A way of rendering django.forms widgets that differentiates between HTML and XHTML.
Python
42
star
68

datasette-auth-github

Datasette plugin that authenticates users against GitHub
Python
41
star
69

puppeteer-screenshot

Vercel app for taking screenshots of web pages using Puppeteer
JavaScript
40
star
70

llm-replicate

LLM plugin for models hosted on Replicate
Python
40
star
71

python-lib-template-repository

GitHub template repository for creating new Python libraries, using the simonw/python-lib cookiecutter template
39
star
72

django-signed

Signing utilities for Django, to try out an API which is being proposed for inclusion in Django core.
Python
37
star
73

museums

A website recommending niche museums to visit
JavaScript
36
star
74

pypi-rename

Cookiecutter template for creating renamed PyPI packages
Python
36
star
75

help-scraper

Record a history of --help for various commands
Python
35
star
76

dbf-to-sqlite

CLI tool for converting DBF files (dBase, FoxPro etc) to SQLite
Python
35
star
77

disaster-data

Data scraped by https://github.com/simonw/disaster-scrapers
35
star
78

asyncinject

Run async workflows using pytest-fixtures-style dependency injection
Python
35
star
79

datasette-publish-vercel

Datasette plugin for publishing data using Vercel
Python
34
star
80

gzthermal-web

A web interface to gzthermal by caveman on encode.ru
Python
32
star
81

asgi-auth-github

ASGI middleware that authenticates users against GitHub
Python
31
star
82

json-head

JSON microservice for performing HEAD requests
Python
31
star
83

django-safeform

CSRF protection for Django forms.
Python
31
star
84

s3-image-proxy

A tiny proxy for serving and resizing images fetched from a private S3 bucket
Python
31
star
85

sqlite-transform

Tool for running transformations on columns in a SQLite database
Python
30
star
86

webhook-relay

A simple Node.js server for queueing and relaying webhook requests
JavaScript
30
star
87

datasette-tiddlywiki

Run TiddlyWiki in Datasette and save Tiddlers to a SQLite database
HTML
29
star
88

image-diff

CLI tool for comparing images
Python
29
star
89

getlatlon.com

Source code for getlatlon.com - a simple, single page, pure JavaScript Google Maps application.
29
star
90

sf-tree-history

Tracking the history of trees in San Francisco
29
star
91

scrape-hacker-news-by-domain

Scrape HN to track links from specific domains
JavaScript
28
star
92

timezones-api

A Datasette-powered API for finding the time zone for a latitude/longitude point
Python
26
star
93

owlsnearme

A website that tells you where your nearest owls are!
JavaScript
26
star
94

datasette-table

A Web Component for embedding a Datasette table on a page
JavaScript
26
star
95

xml-analyser

Simple command line tool for quickly analysing the structure of an arbitrary XML file
Python
26
star
96

shapefile-to-sqlite

Load shapefiles into a SQLite (optionally SpatiaLite) database
Python
26
star
97

cdc-vaccination-history

A git scraper recording the CDC's Covid Data Tracker numbers on number of vaccinations per state.
Python
24
star
98

json-flatten

Python functions for flattening a JSON object to a single dictionary of pairs, and unflattening that dictionary back to a JSON object
Python
24
star
99

datasette-json-html

Datasette plugin for rendering HTML based on JSON values
Python
24
star
100

djangocon-2022-productivity

Supporting links for my DjangoCon 2022 talk
23
star