• Stars
    star
    205
  • Rank 191,264 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 3 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Introduction to OCR with Python

by Dr. W.J.B. Mattingly

Introduction

Optical Character Recognition, or OCR, is a common task in many domains. The earliest OCR systems were designed to serve the vision impaired. Its modern application, however, has extended to a far wider population. The goal of OCR is to take an input image and output raw text while maintaining the structure of the text in the image. In othere words, its end-goal is to preserve the line breaks, paragraph segmentation, and other features of the structure of the text on the page.

This course is designed to teach you how to automate OCR in Python for optimized results. It is meant to function alongside this YouTube Series OCR in Python Tutorials

Organization of Textbook

Lesson Name
01.01 Introduction to OCR
01.02 Introduction to the Libraries
01.03 How to Install Libraries
02.01 The Basics of Pillow
02.02 The Basics of OpenCV
02.03 The Basics of Tesseract
03.01 Passing Pillow Images to OpenCV
03.02 The Basics of OpenCV
03.03 Manipulating the Image
04.01 Bounding Boxes
04.02 Extracting Bounding Boxes
04.03 Organizing Bounding Boxes
05.01 Parameters of Tesseract
05.02 Cleaning the Output of Tesseract
06.01 Workflow for Standard OCR of Text
06.02 Workflow for Ignoring Footnotes
06.03 Workflow for Tables
07.01 Tesseract with non-English
07.02 Tesseract with Early Modern Scripts
07.03 Tesseract with non-Latin Scripts

More Repositories

1

freecodecamp_spacy

Jupyter Notebook
129
star
2

topic_modeling_textbook

Jupyter Notebook
105
star
3

streamlit-pandas

Python
85
star
4

spacyex

SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.
Python
57
star
5

ner_youtube

Python
54
star
6

LeetTopic

Python
53
star
7

python_for_dh

Jupyter Notebook
41
star
8

holocaust_ner_lessons

Jupyter Notebook
40
star
9

qwen2-vl-finetune-huggingface

This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.
Python
38
star
10

hobbit-spacy

Jupyter Notebook
23
star
11

biospacy

Python
21
star
12

spacy_tutorials_3x

Jupyter Notebook
20
star
13

tap-2023-spacy-01

Jupyter Notebook
20
star
14

ww2-spacy

Python
17
star
15

date-spacy

Python
15
star
16

youtube-bertopic

Jupyter Notebook
14
star
17

youtube_booknlp

HTML
14
star
18

youtube-florence-table

Table detection with Florence.
Jupyter Notebook
13
star
19

spacy-chunks

An easy way to chunk spaCy docs.
Python
11
star
20

tap-2024-vector-databases

This is my 2024 course for TAP Institute on Vector Databases and Semantic Searching.
Jupyter Notebook
11
star
21

latin_ner_lesson

Python
11
star
22

streamlit_lessons_youtube

Python
9
star
23

youtube-txtai

Jupyter Notebook
9
star
24

youtube_text_classification

This repo is meant to work alongside my youtube series on Text Classification.
Jupyter Notebook
9
star
25

vulgata-spacy

Python
9
star
26

weaviate-filter

A package for creating GraphQL filters for Weaviate
Python
9
star
27

tap-2024-spacy-llms

This is the repository for my 2024 Tap Institute Course on spaCy with LLMs
Jupyter Notebook
8
star
28

bagpipes-spacy

Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.
Python
8
star
29

keyword-spacy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity.
Jupyter Notebook
8
star
30

bulk-image-clustering

Python
7
star
31

intro-to-ml

Jupyter Notebook
7
star
32

instagram-analysis

Python
7
star
33

spacy_3_ner_tutorials

Jupyter Notebook
7
star
34

textbook_pandas

Jupyter Notebook
7
star
35

intermediate-python-for-dh

HTML
6
star
36

youtube-rembg

Jupyter Notebook
6
star
37

tap-2024-rag

Jupyter Notebook
6
star
38

ml-project-template

My template for machine learning projects
6
star
39

tap-2022-pandas

Jupyter Notebook
5
star
40

intro-nlp-tap-2022

Jupyter Notebook
5
star
41

fewshot-text

Jupyter Notebook
5
star
42

spacy_custom_vectors

Jupyter Notebook
4
star
43

leettopic-test

Jupyter Notebook
4
star
44

textbook_digital_humanities

Python
4
star
45

digital_alcuin_project

Jupyter Notebook
4
star
46

text-analysis-for-ancient-and-medieval-languages

Jupyter Notebook
4
star
47

youtube-shakespeare

Jupyter Notebook
4
star
48

number-spacy

Number spaCy is a custom spaCy pipeline component that enhances the identification of number entities in text and fetches the parsed numeric values using spaCy's token extensions.
Python
4
star
49

youtube-spacy-ml

Jupyter Notebook
4
star
50

youtube-streamlit-link-analysis

A quick repository for using streamlit link analysis component.
Python
3
star
51

neural_networks_for_dh

3
star
52

skweak

Jupyter Notebook
3
star
53

quiz-generator

Jupyter Notebook
3
star
54

youtube-bm25

Jupyter Notebook
3
star
55

wjbmattingly

3
star
56

youtube-streamlit-image-grid

Python
3
star
57

spacy_components

Jupyter Notebook
3
star
58

gliner-finetune

A package for generating synthetic data and fine-tuning a gliner model.
Jupyter Notebook
3
star
59

dap_app

HTML
3
star
60

florida

simple tools to make my life easier
Python
3
star
61

streamlit-openai-functions

Python
3
star
62

yale-lux-overlap

This project demonstrates how to connect multiple records in Yale's Lux search to a single record.
Python
2
star
63

vulgata-spacy-app

Python
2
star
64

cltk_tutorial

Files for cltk tutorial
2
star
65

Patrologia-Latina

This repository is for functions and tools for handling Patrologia Latina (PL) texts.
Python
2
star
66

textbook_pdfs

Jupyter Notebook
2
star
67

ushmm_test_app

Python
2
star
68

cltk-textbook

Jupyter Notebook
2
star
69

latin_cltk_mwt

Python
2
star
70

text_class_models

Jupyter Notebook
2
star
71

text2xmlnolibs

This is the code for a simple video I made on how to convert text to xml in Python without libraries.
Python
2
star
72

Alcuin-Letters

This page hosts the Python functions developed by William Mattingly for quantifying and analyzing Alcuin's Letter Collections
Python
2
star
73

grk_ang_ner_cltk

Python
2
star
74

Vulgate-Neural-Network

This is a sample of the code necessary to train a neural network capable of identifying Scripture in a text. I also include the functions for extracting that data from the text.
Python
2
star
75

open-medieval-bibliography

open-source medieval bibliography
Jupyter Notebook
2
star
76

ushmm_sent_embedding_app

Jupyter Notebook
1
star
77

medieval-htr

A demo for how to use TrOCR Medieval HTR models.
Jupyter Notebook
1
star
78

ushmm_text_pipeline

Python
1
star
79

Grimbot

In the grim future there are only dice, and math
Python
1
star
80

top2vec-demo

Jupyter Notebook
1
star
81

ushmm_ner_app

Python
1
star
82

bap_app

Jupyter Notebook
1
star
83

streamlit-110-demo

Python
1
star
84

florence-2-finetune

Finetuning florence 2 on CATMuS.
Python
1
star
85

themedievalworld

HTML
1
star
86

youtube-feather

Jupyter Notebook
1
star
87

word_embedding_ushmm

Python
1
star
88

tap-2022-multilingual-ner

Jupyter Notebook
1
star
89

ushmm

A python package for working with data at the United States Holocaust Memorial Museum
HTML
1
star
90

bap_sent_embedding

HTML
1
star
91

christie

Jupyter Notebook
1
star
92

streamlit-textbook

1
star
93

setting_pyvis

Python
1
star
94

weaviate-vulgate

Latin vulgate search engine
Jupyter Notebook
1
star
95

demo-latincy

Jupyter Notebook
1
star
96

rebecca_text

Jupyter Notebook
1
star
97

tiktok_python

Python
1
star
98

spacyex-demo

Demo for spaCyEx library.
Python
1
star
99

youtube-clip-demo

Jupyter Notebook
1
star
100

catmus-analysis

A repo for analyzing the CATMuS dataset.
Jupyter Notebook
1
star