• Stars
    star
    196
  • Rank 197,400 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Introduction to OCR with Python

by Dr. W.J.B. Mattingly

Introduction

Optical Character Recognition, or OCR, is a common task in many domains. The earliest OCR systems were designed to serve the vision impaired. Its modern application, however, has extended to a far wider population. The goal of OCR is to take an input image and output raw text while maintaining the structure of the text in the image. In othere words, its end-goal is to preserve the line breaks, paragraph segmentation, and other features of the structure of the text on the page.

This course is designed to teach you how to automate OCR in Python for optimized results. It is meant to function alongside this YouTube Series OCR in Python Tutorials

Organization of Textbook

Lesson Name
01.01 Introduction to OCR
01.02 Introduction to the Libraries
01.03 How to Install Libraries
02.01 The Basics of Pillow
02.02 The Basics of OpenCV
02.03 The Basics of Tesseract
03.01 Passing Pillow Images to OpenCV
03.02 The Basics of OpenCV
03.03 Manipulating the Image
04.01 Bounding Boxes
04.02 Extracting Bounding Boxes
04.03 Organizing Bounding Boxes
05.01 Parameters of Tesseract
05.02 Cleaning the Output of Tesseract
06.01 Workflow for Standard OCR of Text
06.02 Workflow for Ignoring Footnotes
06.03 Workflow for Tables
07.01 Tesseract with non-English
07.02 Tesseract with Early Modern Scripts
07.03 Tesseract with non-Latin Scripts

More Repositories

1

freecodecamp_spacy

Jupyter Notebook
123
star
2

topic_modeling_textbook

Jupyter Notebook
100
star
3

streamlit-pandas

Python
71
star
4

spacyex

SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.
Python
56
star
5

ner_youtube

Python
53
star
6

LeetTopic

Python
53
star
7

holocaust_ner_lessons

Jupyter Notebook
40
star
8

python_for_dh

Jupyter Notebook
36
star
9

spacy_tutorials_3x

Jupyter Notebook
21
star
10

hobbit-spacy

Jupyter Notebook
21
star
11

biospacy

Python
20
star
12

tap-2023-spacy-01

Jupyter Notebook
20
star
13

date-spacy

Python
15
star
14

ww2-spacy

Python
14
star
15

youtube_booknlp

HTML
13
star
16

youtube-bertopic

Jupyter Notebook
11
star
17

latin_ner_lesson

Python
10
star
18

streamlit_lessons_youtube

Python
9
star
19

youtube-txtai

Jupyter Notebook
9
star
20

youtube_text_classification

This repo is meant to work alongside my youtube series on Text Classification.
Jupyter Notebook
9
star
21

vulgata-spacy

Python
9
star
22

weaviate-filter

A package for creating GraphQL filters for Weaviate
Python
8
star
23

bulk-image-clustering

Python
7
star
24

intro-to-ml

Jupyter Notebook
7
star
25

instagram-analysis

Python
7
star
26

bagpipes-spacy

Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.
Python
7
star
27

spacy_3_ner_tutorials

Jupyter Notebook
7
star
28

intermediate-python-for-dh

HTML
6
star
29

youtube-rembg

Jupyter Notebook
6
star
30

textbook_pandas

Jupyter Notebook
6
star
31

keyword-spacy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity.
Jupyter Notebook
6
star
32

leettopic-test

Jupyter Notebook
5
star
33

intro-nlp-tap-2022

Jupyter Notebook
5
star
34

fewshot-text

Jupyter Notebook
5
star
35

spacy_custom_vectors

Jupyter Notebook
4
star
36

tap-2022-pandas

Jupyter Notebook
4
star
37

textbook_digital_humanities

Python
4
star
38

digital_alcuin_project

Jupyter Notebook
4
star
39

text-analysis-for-ancient-and-medieval-languages

Jupyter Notebook
4
star
40

youtube-shakespeare

Jupyter Notebook
4
star
41

number-spacy

Number spaCy is a custom spaCy pipeline component that enhances the identification of number entities in text and fetches the parsed numeric values using spaCy's token extensions.
Python
4
star
42

youtube-spacy-ml

Jupyter Notebook
4
star
43

neural_networks_for_dh

3
star
44

skweak

Jupyter Notebook
3
star
45

quiz-generator

Jupyter Notebook
3
star
46

youtube-bm25

Jupyter Notebook
3
star
47

wjbmattingly

3
star
48

youtube-streamlit-image-grid

Python
3
star
49

spacy_components

Jupyter Notebook
3
star
50

dap_app

HTML
3
star
51

florida

simple tools to make my life easier
Python
3
star
52

streamlit-openai-functions

Python
3
star
53

vulgata-spacy-app

Python
2
star
54

cltk_tutorial

Files for cltk tutorial
2
star
55

Patrologia-Latina

This repository is for functions and tools for handling Patrologia Latina (PL) texts.
Python
2
star
56

textbook_pdfs

Jupyter Notebook
2
star
57

ushmm_test_app

Python
2
star
58

cltk-textbook

Jupyter Notebook
2
star
59

text_class_models

Jupyter Notebook
2
star
60

latin_cltk_mwt

Python
2
star
61

text2xmlnolibs

This is the code for a simple video I made on how to convert text to xml in Python without libraries.
Python
2
star
62

Alcuin-Letters

This page hosts the Python functions developed by William Mattingly for quantifying and analyzing Alcuin's Letter Collections
Python
2
star
63

grk_ang_ner_cltk

Python
2
star
64

gliner-finetune

A package for generating synthetic data and fine-tuning a gliner model.
Jupyter Notebook
2
star
65

Vulgate-Neural-Network

This is a sample of the code necessary to train a neural network capable of identifying Scripture in a text. I also include the functions for extracting that data from the text.
Python
2
star
66

open-medieval-bibliography

open-source medieval bibliography
Jupyter Notebook
2
star
67

ushmm_sent_embedding_app

Jupyter Notebook
1
star
68

ushmm_text_pipeline

Python
1
star
69

Grimbot

In the grim future there are only dice, and math
Python
1
star
70

top2vec-demo

Jupyter Notebook
1
star
71

tiktok_python

Python
1
star
72

ushmm_ner_app

Python
1
star
73

bap_app

Jupyter Notebook
1
star
74

streamlit-110-demo

Python
1
star
75

themedievalworld

HTML
1
star
76

youtube-feather

Jupyter Notebook
1
star
77

bap_sent_embedding

HTML
1
star
78

word_embedding_ushmm

Python
1
star
79

tap-2022-multilingual-ner

Jupyter Notebook
1
star
80

ushmm

A python package for working with data at the United States Holocaust Memorial Museum
HTML
1
star
81

christie

Jupyter Notebook
1
star
82

streamlit-textbook

1
star
83

setting_pyvis

Python
1
star
84

weaviate-vulgate

Latin vulgate search engine
Jupyter Notebook
1
star
85

demo-latincy

Jupyter Notebook
1
star
86

rebecca_text

Jupyter Notebook
1
star
87

spacyex-demo

Demo for spaCyEx library.
Python
1
star
88

youtube-clip-demo

Jupyter Notebook
1
star
89

tap-2024-rag

Jupyter Notebook
1
star
90

catmus-analysis

A repo for analyzing the CATMuS dataset.
Jupyter Notebook
1
star
91

ml-project-template

My template for machine learning projects
1
star
92

streamlit-wizard

Python
1
star