Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

JavaScript

CSS

Lua

C#

Erlang

Groovy

Elixir

Zig

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

C++

Ruby

Go

C#

Kotlin

JavaScript

Julia

Rust

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇩🇿 Algeria

🇷🇺 Russia

🇵🇱 Poland

🇲🇻 Maldives

🇧🇹 Bhutan

🇬🇹 Guatemala

🇸🇪 Sweden

🇰🇾 Cayman Islands

All Countries Compare Countries

wjbmattingly/ocr_python_textbook

Stars
205
Rank 191,264 (Top 4 %)
Language
Jupyter Notebook
Created over 3 years ago
Updated 6 months ago

wjbmattingly/ocr_python_textbook

wjbmattingly

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Introduction to OCR with Python

by Dr. W.J.B. Mattingly

Introduction

Optical Character Recognition, or OCR, is a common task in many domains. The earliest OCR systems were designed to serve the vision impaired. Its modern application, however, has extended to a far wider population. The goal of OCR is to take an input image and output raw text while maintaining the structure of the text in the image. In othere words, its end-goal is to preserve the line breaks, paragraph segmentation, and other features of the structure of the text on the page.

This course is designed to teach you how to automate OCR in Python for optimized results. It is meant to function alongside this YouTube Series OCR in Python Tutorials

Organization of Textbook

Lesson	Name
01.01	Introduction to OCR
01.02	Introduction to the Libraries
01.03	How to Install Libraries
02.01	The Basics of Pillow
02.02	The Basics of OpenCV
02.03	The Basics of Tesseract
03.01	Passing Pillow Images to OpenCV
03.02	The Basics of OpenCV
03.03	Manipulating the Image
04.01	Bounding Boxes
04.02	Extracting Bounding Boxes
04.03	Organizing Bounding Boxes
05.01	Parameters of Tesseract
05.02	Cleaning the Output of Tesseract
06.01	Workflow for Standard OCR of Text
06.02	Workflow for Ignoring Footnotes
06.03	Workflow for Tables
07.01	Tesseract with non-English
07.02	Tesseract with Early Modern Scripts
07.03	Tesseract with non-Latin Scripts

freecodecamp_spacy

Jupyter Notebook

topic_modeling_textbook

Jupyter Notebook

streamlit-pandas

spacyex

SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.

ner_youtube

LeetTopic

python_for_dh

Jupyter Notebook

holocaust_ner_lessons

Jupyter Notebook

qwen2-vl-finetune-huggingface

This project is a collection of fine-tuning scripts to help researchers fine-tune Qwen 2 VL on HuggingFace datasets.

hobbit-spacy

Jupyter Notebook

biospacy

spacy_tutorials_3x

Jupyter Notebook

tap-2023-spacy-01

Jupyter Notebook

ww2-spacy

date-spacy

youtube-bertopic

Jupyter Notebook

youtube_booknlp

youtube-florence-table

Table detection with Florence.

Jupyter Notebook

spacy-chunks

An easy way to chunk spaCy docs.

tap-2024-vector-databases

This is my 2024 course for TAP Institute on Vector Databases and Semantic Searching.

Jupyter Notebook

latin_ner_lesson

streamlit_lessons_youtube

youtube-txtai

Jupyter Notebook

youtube_text_classification

This repo is meant to work alongside my youtube series on Text Classification.

Jupyter Notebook

vulgata-spacy

weaviate-filter

A package for creating GraphQL filters for Weaviate

tap-2024-spacy-llms

This is the repository for my 2024 Tap Institute Course on spaCy with LLMs

Jupyter Notebook

bagpipes-spacy

Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.

keyword-spacy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity.

Jupyter Notebook

bulk-image-clustering

intro-to-ml

Jupyter Notebook

instagram-analysis

spacy_3_ner_tutorials

Jupyter Notebook

textbook_pandas

Jupyter Notebook

intermediate-python-for-dh

youtube-rembg

Jupyter Notebook

tap-2024-rag

Jupyter Notebook

ml-project-template

My template for machine learning projects

tap-2022-pandas

Jupyter Notebook

intro-nlp-tap-2022

Jupyter Notebook

fewshot-text

Jupyter Notebook

spacy_custom_vectors

Jupyter Notebook

leettopic-test

Jupyter Notebook

textbook_digital_humanities

digital_alcuin_project

Jupyter Notebook

text-analysis-for-ancient-and-medieval-languages

Jupyter Notebook

youtube-shakespeare

Jupyter Notebook

number-spacy

Number spaCy is a custom spaCy pipeline component that enhances the identification of number entities in text and fetches the parsed numeric values using spaCy's token extensions.

youtube-spacy-ml

Jupyter Notebook

youtube-streamlit-link-analysis

A quick repository for using streamlit link analysis component.

neural_networks_for_dh

skweak

Jupyter Notebook

quiz-generator

Jupyter Notebook

youtube-bm25

Jupyter Notebook

wjbmattingly

youtube-streamlit-image-grid

spacy_components

Jupyter Notebook

gliner-finetune

A package for generating synthetic data and fine-tuning a gliner model.

Jupyter Notebook

dap_app

florida

simple tools to make my life easier

streamlit-openai-functions

yale-lux-overlap

This project demonstrates how to connect multiple records in Yale's Lux search to a single record.

vulgata-spacy-app

cltk_tutorial

Files for cltk tutorial

Patrologia-Latina

This repository is for functions and tools for handling Patrologia Latina (PL) texts.

textbook_pdfs

Jupyter Notebook

ushmm_test_app

cltk-textbook

Jupyter Notebook

latin_cltk_mwt

text_class_models

Jupyter Notebook

text2xmlnolibs

This is the code for a simple video I made on how to convert text to xml in Python without libraries.

Alcuin-Letters

This page hosts the Python functions developed by William Mattingly for quantifying and analyzing Alcuin's Letter Collections

grk_ang_ner_cltk

Vulgate-Neural-Network

This is a sample of the code necessary to train a neural network capable of identifying Scripture in a text. I also include the functions for extracting that data from the text.

open-medieval-bibliography

open-source medieval bibliography

Jupyter Notebook

ushmm_sent_embedding_app

Jupyter Notebook

medieval-htr

A demo for how to use TrOCR Medieval HTR models.

Jupyter Notebook

ushmm_text_pipeline

Grimbot

In the grim future there are only dice, and math

top2vec-demo

Jupyter Notebook

ushmm_ner_app

bap_app

Jupyter Notebook

streamlit-110-demo

florence-2-finetune

Finetuning florence 2 on CATMuS.

themedievalworld

youtube-feather

Jupyter Notebook

word_embedding_ushmm

tap-2022-multilingual-ner

Jupyter Notebook

ushmm

A python package for working with data at the United States Holocaust Memorial Museum

bap_sent_embedding

christie

Jupyter Notebook

streamlit-textbook

setting_pyvis

weaviate-vulgate

Latin vulgate search engine

Jupyter Notebook

demo-latincy

Jupyter Notebook

rebecca_text

Jupyter Notebook

tiktok_python

spacyex-demo

Demo for spaCyEx library.

youtube-clip-demo

Jupyter Notebook

catmus-analysis

A repo for analyzing the CATMuS dataset.

Jupyter Notebook