• Stars
    star
    391
  • Rank 109,356 (Top 3 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

📋 Python wrapper to grab text from images and save as text files using Tesseract Engine

Image2Text

Build Status

Image2Text is a python wrapper to grab text from images and save as text files using Google Tesseract Engine. Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. In 2006 Tesseract was considered one of the most accurate open-source OCR engines then available.

Quick Links:

Usage

python main.py -i <input_path> -o <output_path>
usage: main.py [-h] -i INPUT [-o OUTPUT] [-d]

required arguments:
  -i INPUT, --input INPUT       Single image file path or images directory path

optional arguments:
  -o OUTPUT, --output OUTPUT    (Optional) Output directory for converted text
  -d, --debug                   Enable verbose DEBUG logging
python main.py -i sample/

or

python main.py -i sample/ -o output/

Running Tests

python -m unittest

Tesseract Installation

Linux

[sudo] apt-get install tesseract-ocr

Windows

  1. Install tesseract-ocr from UB Mannheim here: https://github.com/UB-Mannheim/tesseract/wiki
  2. Add the installed Tesseract-OCR directory path to PATH system variable

Mac

brew install tesseract

Sample Results

Sample Image

(Wikipedia page for Google | Lang : Simple English)

Text output

A man signing in at Google’s main afice, Googleplex.

Google Inc. is an American multinational corporation
that is best known for running one of the largest search
engines on the World Wide Web (WWW). Every day,
200 million (200,000,000) people use it. Google’s main
office (“Googleplex”) is in Mountain View, California,
USA.

With Google Search, people can also search for pictures,
Usenet newsgroups, news, and things to buy online. By
June 2004, Google had 4.28 billion web pages on its
database, 880 million (880,000,000) pictures and 845
million (845,000,000) Usenet messages — six billion
things.

“To google,” as an action word (verb) means “to search
for something on Google”. Because Google is so popular
(more than half of people on the web use it) it has been
used to mean “to search the web”. Google dislikes this
use since the name of the company is a trademark.

As a public company, Google Inc. trades on the
NASDAQ under the tickers GOOG and GOOGL.

In August 2015, Google announced it was being restruc-
tured under a new holding company called Alphabet Inc.

1 History

Google was started in early 1996 by Larry Page and
Sergey Brin, two students at Stanford University, USA.
It used to be called Backrub. Later, they made it into a
company, Google Inc., on September 7, 1998 at a friend’s
garage in Menlo Park, California. In February 1999, the
company moved to 165 University Ave., Palo Alto, Cal-
ifornia. Later that year, it moved to another place, now

called the “Googleplex”.

In September 2001, Google’s rating system (“PageR-
ank”, for saying which information is more helpful) got a
US. Patent. The patent was to Stanford University, with
Lawrence (Larry) Page as the inventor (the person who
first had the idea).

Google makes an important, though shrinking, percent-
age of its money through its friends like America Online
and InterActiveCorp. It has a special group known as the
Partner Solutions Organization (PSO) which helps make
contracts, helps making accounts better, and gives engi-
neering help.

2 How Google makes money

Google makes money by advertising. People or compa-
nies who want people to buy their product, service, or
ideas give Google money, and Google shows an adver-
tisement to people Google thinks will click on the adver-
tisement. Google only gets money when people click on
the link, so it tries to know as much about people as pos-
sible to only show the advertisement to the “right people”.
It does this with Google Analytics, which sends data back
to Google whenever someone visits a web site. From this
and other data, Google makes a profile about the person,
which it then uses to figure out which advertisements to
show.

3 The name “Google”

The name “Google” is a misspelling of the word
g00g01.[7][8] Milton Sirotta, nephew of US. mathemati-
cian Edward Kasner, made this word in 1938, for the
number 1 followed by one hundred zeroes ( 10100 ). It
is said that the word “googol” was chosen as a name for
this number because it sounded like baby talk. Google
uses this word because the company wants to make lots
of stuff on the Web easy to find and use. Andy Bechtol-
sheim first thought of the name.

The name for Google’s main office, the “Googleplex,” is a
play on a different, even bigger number, the "googolpleX",
which is 1 followed by one googol of zeroes.


Stargazers over time

Stargazers over time

More Repositories

1

library-management-system

📚 An automated library management system developed in Laravel 4.2 PHP MVC Framework
PHP
345
star
2

WA-Reader

💬 WA Reader is a platform to read WhatsApp conversations from email text backups in a easy-to-read UI.
HTML
166
star
3

vertikin

👓 Platform to automatically detect what user might be interested in buying in near future
Python
78
star
4

github-email-extractor

😎 Chrome extension to fetch the email ID of a user even if they haven't made it public on their GitHub profile
JavaScript
61
star
5

whatsapp-emoticons

🌐 Browser extension to convert text smileys to WhatsApp Emoticons to make life simple while using WhatsApp Web
JavaScript
33
star
6

aware

Platform for people to raise their opinions and concerns related to environment and pollution in their locality
Java
31
star
7

dove

💰 An offline payment service
Ruby
28
star
8

github_email

🔎 Python package to get email ID of any GitHub user even if they are not public
Python
26
star
9

wikipedia-frequency-lookup

Simple script written in Python to get the 20 words with highest frequency in an English Wikipedia article
Python
17
star
10

chat-server-and-client

Simple server and client side python scripts for instant messaging
Python
14
star
11

github-classifier

[Deprecated] A chrome extension to linguistically classify and count the number of repositories according to filters for any user on GitHub.
JavaScript
13
star
12

image-parser

It is a web-app which extracts all the images on any web link. Just enter the name of the website and get all the images which are visible on that page
PHP
13
star
13

strategy-game

Game of Thrones themed strategy game 🔫 💣 🔪
PHP
12
star
14

to-do-list-manager

It is a basic web-application designed in PHP which allows you to simply manage your daily tasks in a compiled format. You can add your tasks and date for which they are scheduled. You can tick (mark as done) the tasks which are complete also you can delete the tasks if you no longer need them.
PHP
12
star
15

MySQL-admin-panel

MySQL Admin Panel is a simple platform built for performing basic operations (insert, update and delete) on a single table maintained in MySQL. The operations performed are done in PHP.
PHP
11
star
16

IGN-top-100-villian

Simple parser script ✌️
PHP
8
star
17

date-info

API to let user fetch the events that happened on a specific date
PHP
8
star
18

HackIIITD-2015-events-guide-Server-side

A Delhi events android app, which can act as a complete master control for an explorer
PHP
8
star
19

website-status-cli

Check if a website is up or not
Python
7
star
20

github-supported-languages

📝 Script to get a list of all the languages which are currently supported by GitHub
Python
7
star
21

kayako-twitter-client

A simple Twitter API client in PHP for Kayako to fetch and display Tweets that have been Re-tweeted at least once and contain the hashtag #custserv
PHP
7
star
22

ask-hotel

Tool to automate hotel operations reducing efforts and time for hotel administration and staff using Alexa.
Java
6
star
23

quatroop

Dynamic user hierarchy system which can link any selected user to 'n' number of managers and 'n' number of team members
PHP
6
star
24

iShovel

Disruptive Language Picker for teams in a hurry
Python
6
star
25

python-word-counter

Python
5
star
26

github-email-extractor-server

Server code for the chrome extension to fetch the email ID of a user even if they haven't made it public on their GitHub profile
Python
5
star
27

resonanz-14-15

resonanz intra college fest for 2014-2015
CSS
3
star
28

symfony-101

A simple Symfony (PHP MVC Framework) web app to generate a random number.
PHP
3
star
29

IPC-using-FIFO

Simple implementation of an InterProcess Communication using FIFO buffer.
C
2
star
30

radicalization-TT-dataset

Dataset Repository for "Identifying Radical Social Media Posts using Machine Learning"
2
star
31

sync-wars

A simple two-way application to synchronise directories across systems
Python
2
star
32

radical-social-media-post-classification

Code Repository for "Identifying Radical Social Media Posts using Machine Learning"
Python
2
star
33

dumbGeeks

Repository of select codes from GeeksforGeeks
C++
1
star
34

coursera-programming-exercises

Coursera programming exercises solution
MATLAB
1
star