• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language
    Shell
  • License
    MIT License
  • Created about 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

`pdf2searchablepdf input.pdf` = voila! "input_searchable.pdf" is created & now has searchable text!

Hits

>> Sponsor Me on GitHub <<

PDF2SearchablePDF

Status

It works! See Changelog below.

Table of Contents

(click to expand)
  1. Description:
    1. Operating Systems:
    2. Usage:
    3. Image size notes:
    4. Compress your post-processed PDF:
  2. Quick Start:
    1. Install:
    2. Uninstall:
    3. Use:
  3. Dependencies:
    1. You Must Install these:
    2. It also relies on these, but they come pre-installed on Ubuntu:
  4. PDF2SearchablePDF Installation:
  5. Sample run and output:
  6. Changelog
    1. [v0.7.0] - 2022-11-04
    2. [v0.6.0] - 2022-10-12
    3. [v0.5.0] - 2021-03-02
    4. [v0.4.0] - 2020-03-14
    5. [v0.3.0] - 2019-12-29
    6. [v0.2.0] - 2019-12-29
    7. [v0.1.0] - 2019-11-10
  7. Alternative Software:
  8. KEYWORDS

Description:

tesseract has the ability to do OCR (Optical Character Recognition) on image files, but unfortunately NOT on PDF files as inputs. This is unfortunate, as it means it's a pain to try to convert a PDF to a searchable PDF, so this program scripts the process using existing tools in order to make it stupid-simple for ANYONE to use!

Operating Systems:

Windows (untested, but I think it would work), Mac (untested, but should work), and Linux (tested and works):

  • Developed and tested primarily in Linux Ubuntu 16.04, 18.04, and 20.04, but should run on any of the 3 operating systems I think: Windows, Mac, and Linux.
  • For Windows, I think you can get it to run inside the Windows Subsystem for Linux (WSL), Cygwin, or in the terminal provided with Git for Windows (usually my preference when using Windows).
    • Here are some options to install the tesseract OCR (Optical Character Recognition) program, which pdf2searchablepdf requires, in Windows. See also the official tesseract documentation on this here and here.
      1. [Probably the easiest/best]: get Windows tesseract .exe binary files directly from UB-Mannheim here: https://github.com/UB-Mannheim/tesseract/wiki.
      2. Cygwin packages for tesseract: https://cygwin.com/cgi-bin2/package-grep.cgi?grep=tesseract&arch=x86_64.
    • Once you get tesseract installed from the .exe file provided by UB-Mannheim above, for instance, I'm pretty sure you can then just install Git for Windows, open the bash terminal it provides, and run pdf2searchablepdf from there. This assumes that the .exe installer places tesseract in the Windows PATH so that you can call tesseract from the command-line/MS-DOS-prompt in Windows.

Usage:

See help menu for full details & more examples:

pdf2searchablepdf -h

From the help menu:

pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0

Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf" by using tesseract to perform OCR (Optical Character Recognition) on the PDF.

Usage:

pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
        If the 1st positional argument (after options) is to an input pdf, then convert
        input.pdf to input_searchable.pdf using language "lang" for OCR. Otherwise, if the 1st
        argument is a path to a directory containing a bunch of images, convert the whole
        directory of images into a single PDF, using language "lang" for OCR!
pdf2searchablepdf
        print help menu, then exit

Options:

[-h|-?|--help]
        print help menu, then exit
[-v|--version]
        print author & version, then exit
[-d|--debug]
        Turn debug prints on while running the script
[-upw <password>]
        Specify the user password to open and read the PDF file. This option is passed directly
        through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR.
[--run_tests]
        Run unit tests for this program.

Examples:

pdf2searchablepdf mypdf.pdf deu
        Convert mypdf.pdf to a searchable PDF, using German text OCR, or
pdf2searchablepdf mypdf.pdf
        Convert mypdf.pdf to a searchable PDF, using English text OCR (the default).
pdf2searchablepdf mypdf.pdf --debug
        Same as above, except also print out the debug prints.
pdf2searchablepdf dir_of_imgs
        Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF.
pdf2searchablepdf .
        Convert all images in the present directory, indicated by '.', to a single, searchable
        PDF.
pdf2searchablepdf -upw 1234 mypdf.pdf
        Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user
        password "1234" to open up and read the PDF.
pdf2searchablepdf mypdf.pdf -upw 1234
        Same as above.

Option Details:

[lang]
    The optional [lang] argument allows you to perform OCR in your language of choice. This
    parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes.
    Ex: "deu" for German, "dan" for Danish, "eng" for English, etc. See the "LANGUAGES"
    section of the tesseract man pages ('man tesseract') for a complete list. If the [lang]
    parameter is not given, English will be used by default. If you don't have a desired
    language installed, it may be obtained from one of the following 3 repos (see tesseract man
    pages for details):
      - https://github.com/tesseract-ocr/tessdata_fast
      - https://github.com/tesseract-ocr/tessdata_best
      - https://github.com/tesseract-ocr/tessdata
    To install a new language, simply download the respective "*.traineddata" file from one of
    the 3 repos above and copy it to your tesseract installation's "tessdata" directory.
    See "Post-Install Instructions" here:
    https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions

Image size notes:

Note that when converting an entire directory of images, if the images are large (ex: jpeg images at 3MB each) when you start, your searchable PDF at the end will be very large too! Simply sum the sizes of all the images to know how big the final PDF file will be! To reduce its size, one quick-and-easy way is to compress the jpeg images using jpegoptim before you call pdf2searchablepdf. Read more about jpegoptim here: https://www.tecmint.com/optimize-and-compress-jpeg-or-png-batch-images-linux-commandline/.

Here's an example demonstrating how to install jpegoptim, use it to compress an entire directory of jpeg images, then call pdf2searchablepdf to turn the whole directory of images into one single, searchable pdf. Assume your directory containing all the jpeg images is called "dir_of_imgs". Be sure no spaces exist anywhere in its path!

Install jpegoptim:

sudo apt update
sudo apt install jpegoptim

Compress all the images, then convert all of them to a single, searchable PDF:

jpegoptim --size=500k dir_of_imgs/*.jpg # Compress the whole dir of images! NB: `--size=300k` is 
                                        # about the smallest I'd go.
pdf2searchablepdf dir_of_imgs           # Now make 1 searchable pdf out of all of them!

For my particular case, with 7 jpeg images originally in the 2.5 to 3MB size range, the end result without jpegoptim was a 20 MB PDF, which is too large to email! By calling jpegoptim --size=500k as shown above, first, it shrunk the image size to approx. 500kB each, which meant the final PDF size was about 3.5MB instead of 20MB! Big improvement! Now I can email the file, and the images still look pretty good!

Compress your post-processed PDF:

Update: as of v0.6.0, just use the -c or --compress option. Ex:

pdf2searchablepdf -c input.pdf
# or (same thing)
pdf2searchablepdf --compress input.pdf

See also my answer here: AskUbuntu.com: How can I reduce the file size of a scanned PDF file?.

Quick Start:

See here: https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881

Tested on Ubuntu 18.04 and 20.04.

Install:

  1. Install dependencies

    sudo apt update
    sudo apt install tesseract-ocr ghostscript
  2. Download the repository, and run the install script:

    git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
    cd PDF2SearchablePDF
    ./install.sh
  3. Log out and log back in if that script just created your ~/bin dir (if ls ~/bin shows only the one pdf2searchablepdf symlink in that dir, and nothing else, then this is likely the case).

    Note about what this does: this step simply causes the ~/.profile file in Ubuntu to add ~/bin to your executable PATH, so long as the ~/bin dir exists. If you need to manually add the ~/bin dir to your PATH (because you're using a different Linux distribution, for instance, which does not use the ~/.profile file like this) you can run this command to add it to your path just in the terminal you have open:

    PATH="$HOME/bin:$PATH"

    OR, you can add this to the bottom of your ~/.bashrc file, then close and re-open your terminal. Note: this is copied from Ubuntu's default ~/.profile file:

    # set PATH so it includes user's private bin if it exists
    if [ -d "$HOME/bin" ] ; then
        PATH="$HOME/bin:$PATH"
    fi
  4. (Optional, but recommended) run tests.

    ./run_tests.sh
    # Then, manually visually scan the output messages and inspect the 
    # output searchable PDF files to ensure everything looks like it worked 
    # correctly.
  5. Lastly, ensure you do NOT delete the PDF2SearchablePDF repository you downloaded, as the install script didn't copy the executable out of it, it created an executable symlink which points to it.

Uninstall:

Uninstallation is simple, if desired. You just need to run the commands below to delete a few things, or delete those things manually using your favorite file manager, such as nemo (see my detailed installation instructions for nemo in Ubuntu here).

# 1. delete the symlink in ~/bin
rm ~/bin/pdf2searchablepdf

# 2. (Optional) delete the entire PDF2SearchablePDF repository directory, and all contents
# in it. WARNING! CHOOSING THE WRONG PATH HERE WILL erase everything in the folder you specify,
# so BE VERY CAUTIOUS!
rm -rf path/to/PDF2SearchablePDF

# 3. (Optional) remove dependencies
sudo apt remove tesseract-ocr

Use:

pdf2searchablepdf mypdf.pdf

You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text! See pdf2searchablepdf -h for many more usage examples.

Done. The wrapper has no python dependencies, as it's currently written entirely in bash.

Dependencies:

This has been tested on Ubuntu 18.04 and 20.04. It requires the following programs:

You Must Install these:

sudo apt update 
sudo apt install tesseract-ocr

See: https://github.com/tesseract-ocr/tesseract/wiki

It also relies on these, but they come pre-installed on Ubuntu:

  1. pdftoppm

PDF2SearchablePDF Installation:

Simply run the "install.sh" script to create a symbolic link to pdf2searchablepdf in your ~/bin directory:

./install.sh

In short, just follow the "Install" instructions above under the "Quick Start" section.

Sample run and output:

$ pdf2searchablepdf ./test_pdfs/test1.pdf 
pdf2searchablepdf version 0.1.0
=================================================================================
Converting input PDF (./test_pdfs/test1.pdf) into a searchable PDF
=================================================================================
Creating temporary working directory: "pdf2searchablepdf_temp_20191111-001431.509915114"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
  working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
  operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "./test_pdfs/test1_searchable.pdf".
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 0 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-3.tif
Done! Searchable PDF generated at "./test_pdfs/test1_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20191111-001431.509915114".
Done!

Total script run-time: 136 sec
END OF pdf2searchablepdf.

Changelog

INITIAL DEVELOPMENT PHASE:

  • Use version numbers 0.MINOR.PATCH for the initial development phase; ex: 0.1.0, 0.2.0, etc.
  • Increment just the MINOR version number for each new 0.y.z development phase enhancement, until the project is mature enough that you choose to move to a 1.0.0 release
  • You may increment the PATCH number for bug fixes to your development code, or just increment the MINOR version number if there are also enhancements

MORE MATURE PHASE:

  • As the project matures, release a 1.0.0 version
  • Once you release a 1.0.0 version, do the following (copied from semver.org):
  • Given a version number MAJOR.MINOR.PATCH, increment the:
  1. MAJOR version when you make incompatible API changes,
  2. MINOR version when you add functionality in a backwards compatible manner, and
  3. PATCH version when you make backwards compatible bug fixes.

[v0.7.0] - 2022-11-04

[v0.6.0] - 2022-10-12

  • Added -c/--compress option to output compressed copies of the output PDF as well.
    • Partially fulfills this ticket: #11
    • Post-processing the PDF is a crude way to do it, but it's better than nothing. A better way to do it in the future would be to do OCR on the high-quality images and output the data to an intermediate format, then compress the images as desired and overlay the output OCR data onto the custom-compressed images. That will have to be future work.

[v0.5.0] - 2021-03-02

  • Massively improved the way argument parsing is done.
  • Added additional parsing options for debug prints and converting user-password-protected PDFs. Use the -upw <password> option to pass in a PDF's user password to be able to open and convert it. This works great on my password-protected home mortgage documents scanned and sent to me from the title company!

[v0.4.0] - 2020-03-14

  • Updated install.sh & pdf2searchablepdf.sh scripts to allow spaces in path names; fixes issue #6
  • Move argument parsing code into parse_args() function inside pdf2searchablepdf.sh
  • Moved all main code into main() function inside pdf2searchablepdf.sh, and added time command to the call to main to time how long main takes to run

[v0.3.0] - 2019-12-29

  • Added a big new feature to allow the user to convert a whole directory containing a bunch of images into a single, searchable pdf!
  • New usage: pdf2searchablepdf <input.pdf | dir_of_imgs> [lang]
  • Also added print of run duration at end in units of minutes too instead of just seconds.

[v0.2.0] - 2019-12-29

  • Improved help menu, which is accessible via: pdf2searchablepdf -h or pdf2searchablepdf -? or pdf2searchablepdf
  • Added ability to set the OCR language; new usage: pdf2searchablepdf <input.pdf> [lang]

[v0.1.0] - 2019-11-10

  • Initial release. It works!
  • Can only convert a pdf to a searchable pdf in English, which is tesseract's default setting.
  • Usage: pdf2searchablepdf <input.pdf>

Alternative Software:

  1. https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty#4-others-utilities-tools-command-line-interfaces-cli-etc
    1. https://github.com/jbarlow83/OCRmyPDF
    2. https://github.com/LeoFCardoso/pdf2pdfocr
  • See my issue here: #5. Are these alternatives better than my project here? Do I offer something they don't? Should I continue this project or just switch to using one of the projects listed above? I need to investigate and find out more!

KEYWORDS

(to make this repo more "Googlable"):

pdf 2 searchable pdf, pdftosearchablepdf, pdf to searchable pdf, make pdf searchable, perform ocr on pdf to make it searchable, extract text from pdf, pdf to text, how to make a PDF document searchable, how to make an unsearchable PDF document searchable, how to perform OCR (Optical Character Recognition) on a PDF image, linux convert directory of images into a single pdf, linux convert images to pdf, images to pdf, images2pdf, linux convert a folder of images into a single pdf, tif to pdf, tiff to pdf, png to pdf, bmp to pdf, jpg to pdf, jpeg to pdf, folder of pictures to pdf, ocr on pictures, ocr on images, pictures ocr to searchable pdf

More Repositories

1

eRCaGuy_dotfiles

.bashrc file, terminal prompt that shows current git branch, Arduino setup, Eclipse setup, git diff with line numbers, helpful scripts, improved Linux productivity, etc.
Shell
204
star
2

eRCaGuy_hello_world

"hello world" demos & templates for various languages, for beginners and experts alike, incl. gcc build commands for C & C++
C
118
star
3

Windows_Proxy_Toggler

A clickable icon on your Windows desktop to toggle your proxy on and off.
VBScript
72
star
4

eRCaGuy_TimerCounter

An Arduino micros()-like function (encapsulated in a library) with 0.5us precision (since the built-in Arduino micros() function has only 4us precision)
C++
28
star
5

fixed_point_math

Fixed point math practice & test code
C++
27
star
6

RealtekWiFiAdapterSoftware

Software that comes on a tiny little CD with the "LGTERK 1200Mbps USB 3.0 Wifi Adapter" from Amazon
C
25
star
7

ripgrep_replace

ripgrep_replace, or rgr, is a light-weight wrapper around ripgrep, supporting 100% of ripgrep's features + adding '-R' to enable on-disk find-and-replace; rgf2 is an interactive regex finder and viewer with syntax highlighting
Shell
21
star
8

eRCaGuy_PyTime

A simple milli & microsecond-resolution timestamp & loop-synchronization module for Python
Python
20
star
9

eRCaGuy_PyTerm

A datalogging serial terminal/console written in Python (I hope to extend it to Telnet and others later)
Python
15
star
10

eRCaGuy_analogReadXXbit

-A library which does oversampling to allow you to read with a resolution of 10-bit to 21-bit on the Arduino ADC (Analog to Digital Converter)!
C++
12
star
11

BrosTrendWifiAdapterSoftware

Software that comes on the mini CD (which is inconvenient) with BrosTrend WiFi adapters, and my personal installation notes for Linux.
Shell
12
star
12

eRCaGuy_PPM_Writer

An >=11-bit, jitter-free (hardware-based) RC radio PPM signal generator for the Arduino!
C++
7
star
13

MATLAB-Arduino_PPM_Reader_GUI

A program which uses an Arduino to read the stick & switch positions out of the PPM signal on a standard RC Transmitter, and plot and log the data live in MATLAB.
Arduino
5
star
14

Microchip_XC32_Compiler

[Works!] Build your own license-free, GPL version of the latest gcc-based Microchip MPLAB X IDE XC32 C++ compiler
5
star
15

eRCaGuy_ComputaPranksta_Support

Public support for my "Computa Pranksta" mouse jiggler device I sell on Amazon and elsewhere.
5
star
16

ElectricRCAircraftGuy.github.io

My github pages website at gabrielstaples.com
JavaScript
5
star
17

AlfaWiFiAdapterSoftware

Software that comes on the CDs (which are inconvenient) with Alfa WiFi adapters, and my personal installation notes for Linux
C
5
star
18

sublime_gcode

gcode (g-code) syntax highlighting for the Sublime Text editor; useful for viewing or editing CNC or 3D printer gcode files
G-code
4
star
19

eRCaGuy_TouchLamp

-an Arduino-based touch-lamp, using capacitive touch, with NO touch library (just a function) and NO additional touch hardware (just an Arduino and a wire)
Arduino
4
star
20

eRCaGuy_gtest_practice

googletest (gtest) & googlemock (gmock) practice
C++
3
star
21

ElectricRCAircraftGuy

Let me introduce myself to you...
2
star
22

eRCaGuy_ButtonReader

-Based on the main Arduino "Debounce" example, read a button or switch's latest action or state easily, so you can act when the button is pressed OR released!
C++
2
star
23

FreeFileSync

- a copy of FreeFileSync source (~v9.7 or later) from https://www.freefilesync.org/ that I'm patching to make buildable in Linux Ubuntu
C++
2
star
24

PWM_Reader2_WORKS_PERFECTLY_Hayden_car_lights

-not an ongoing project, just a quick proof-of-concept to read an RC PWM signal
C++
2
star
25

Banned_Google_Reviews

Use the Issues tab of this repo to paste your Google reviews which have been removed from Google Maps because someone affiliated with the reviewed establishment clicked the "flag" button to ban the review since it was negative.
2
star
26

eRCaGuy_Peer2Peer

A peer-to-peer send and receive synchronous half-duplex communication protocol with handshaking and auto-timeout, that requires only 2 (and any 2) pins, no timers, and no interrupts
C++
2
star
27

eRCaGuy_WDTimer

-use this code to use the watchdog timer to attach an interrupt that automatically perform some action every ___ms (user-defined). The benefit here is that your attached function is guaranteed to execute at the interval you specify (it can't get blocked by delay() or other functions in your main loop), and that is uses the *watchdog timer*, thereby keeping your other timers (ex: Timer0, Timer1, Timer2) free to be used by other libraries!
C++
2
star
28

MAX5481_DigitalPotCommander

sample code to control and command a MAX5481 digital potentiometer using SPI commands, including storing commands in the chip's on-board EEPROM, or not
Arduino
2
star
29

eRCaGuy_CodeFormatter

A collection of scripts & configuration files to quickly and easily format your code (by calling clang-format, for instance)
Python
1
star
30

eRCaGuy_EventReader

-the most thorough and robust button and event debouncing algorithm for Arduino that man has ever known :)
C++
1
star
31

bug_reports

Consumer bug reports you and I can report for any product or service we use. Add additional details & screenshots to an Issue here and link to it when submitting help requests through their website.
1
star
32

Arduino-STEM-Presentation

An in-depth half-day presentation and workshop I presented to a bunch of teachers in Dayton, Ohio in 2014 while working for the Air Force Research Laboratory.
C++
1
star
33

LibreBulletin

A semi-automated LibreOffice bulletin for Sunday meetings of the Church of Jesus Christ of Latter-day Saints
Python
1
star
34

TVBGone

My TVBGone code version (basically just very minor changes from Ken Shirriff's latest V1.2, which he ported to Arduino)
C++
1
star
35

FIT-Waving-Hand

-gimmicky trinket I programmed at the Florida Institute of Technology (FIT) using at the minimum a light sensor and two servos with a toy hand on a stick to wave at people as they pass by
Arduino
1
star
36

Chrome-Case-Sensitive-Find

A case-sensitive Find tool (recommended to use Ctrl + Shift + F or Ctrl + Alt + F) for the Google Chrome Browser
JavaScript
1
star
37

eRCaGuy_backup

Easily back up your files on any Linux system via a Bash rsync wrapper which supports dry-runs, include & exclude files, and nice logging.
Shell
1
star
38

eRCaGuy_Engineering

General engineering equation sheets, notes, knowledge, links, etc, including Electrical and Aeronautical Engineering
C
1
star
39

arduino-softtimer

Automatically exported from code.google.com/p/arduino-softtimer
C++
1
star
40

eRCaGuy_X9C_digital_pot

Arduino driver for the Renesas X9C102 (1k), X9C103 (10k), X9C503 (50k), and X9C104 (100k) family of digital potentiometers.
C++
1
star
41

ArduSTM

Going from Arduino to professional with STM32 microcontrollers.
1
star