PDF2SearchablePDF
Status
It works! See Changelog below.
Table of Contents
(click to expand)
Description:
tesseract
has the ability to do OCR (Optical Character Recognition) on image files, but unfortunately NOT on PDF files as inputs. This is unfortunate, as it means it's a pain to try to convert a PDF to a searchable PDF, so this program scripts the process using existing tools in order to make it stupid-simple for ANYONE to use!
Operating Systems:
Windows (untested, but I think it would work), Mac (untested, but should work), and Linux (tested and works):
- Developed and tested primarily in Linux Ubuntu 16.04, 18.04, and 20.04, but should run on any of the 3 operating systems I think: Windows, Mac, and Linux.
- For Windows, I think you can get it to run inside the Windows Subsystem for Linux (WSL), Cygwin, or in the terminal provided with Git for Windows (usually my preference when using Windows).
- Here are some options to install the
tesseract
OCR (Optical Character Recognition) program, whichpdf2searchablepdf
requires, in Windows. See also the official tesseract documentation on this here and here.- [Probably the easiest/best]: get Windows tesseract .exe binary files directly from UB-Mannheim here: https://github.com/UB-Mannheim/tesseract/wiki.
- Cygwin packages for tesseract: https://cygwin.com/cgi-bin2/package-grep.cgi?grep=tesseract&arch=x86_64.
- Once you get tesseract installed from the .exe file provided by UB-Mannheim above, for instance, I'm pretty sure you can then just install Git for Windows, open the bash terminal it provides, and run
pdf2searchablepdf
from there. This assumes that the .exe installer placestesseract
in the Windows PATH so that you can calltesseract
from the command-line/MS-DOS-prompt in Windows.
- Here are some options to install the
Usage:
See help menu for full details & more examples:
pdf2searchablepdf -h
From the help menu:
pdf2searchablepdf ('pdf2searchablepdf') version 0.5.0
Purpose: convert "input.pdf" to a searchable PDF named "input_searchable.pdf" by using tesseract to perform OCR (Optical Character Recognition) on the PDF.
Usage:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang] If the 1st positional argument (after options) is to an input pdf, then convert input.pdf to input_searchable.pdf using language "lang" for OCR. Otherwise, if the 1st argument is a path to a directory containing a bunch of images, convert the whole directory of images into a single PDF, using language "lang" for OCR! pdf2searchablepdf print help menu, then exit
Options:
[-h|-?|--help] print help menu, then exit [-v|--version] print author & version, then exit [-d|--debug] Turn debug prints on while running the script [-upw <password>] Specify the user password to open and read the PDF file. This option is passed directly through to the 'pdftoppm' cmd used internally to convert the PDF to images for OCR. [--run_tests] Run unit tests for this program.
Examples:
pdf2searchablepdf mypdf.pdf deu Convert mypdf.pdf to a searchable PDF, using German text OCR, or pdf2searchablepdf mypdf.pdf Convert mypdf.pdf to a searchable PDF, using English text OCR (the default). pdf2searchablepdf mypdf.pdf --debug Same as above, except also print out the debug prints. pdf2searchablepdf dir_of_imgs Convert all images in this directory, "dir_of_imgs", to a single, searchable PDF. pdf2searchablepdf . Convert all images in the present directory, indicated by '.', to a single, searchable PDF. pdf2searchablepdf -upw 1234 mypdf.pdf Convert mypdf.pdf to a searchable PDF, using English text OCR, while using the user password "1234" to open up and read the PDF. pdf2searchablepdf mypdf.pdf -upw 1234 Same as above.
Option Details:
[lang] The optional [lang] argument allows you to perform OCR in your language of choice. This parameter will be passed on to tesseract. You must use ISO 639-2 3-letter language codes. Ex: "deu" for German, "dan" for Danish, "eng" for English, etc. See the "LANGUAGES" section of the tesseract man pages ('man tesseract') for a complete list. If the [lang] parameter is not given, English will be used by default. If you don't have a desired language installed, it may be obtained from one of the following 3 repos (see tesseract man pages for details): - https://github.com/tesseract-ocr/tessdata_fast - https://github.com/tesseract-ocr/tessdata_best - https://github.com/tesseract-ocr/tessdata To install a new language, simply download the respective "*.traineddata" file from one of the 3 repos above and copy it to your tesseract installation's "tessdata" directory. See "Post-Install Instructions" here: https://github.com/tesseract-ocr/tessdoc/blob/master/Compiling-%E2%80%93-GitInstallation.md#post-install-instructions
Image size notes:
Note that when converting an entire directory of images, if the images are large (ex: jpeg images at 3MB each) when you start, your searchable PDF at the end will be very large too! Simply sum the sizes of all the images to know how big the final PDF file will be! To reduce its size, one quick-and-easy way is to compress the jpeg images using jpegoptim
before you call pdf2searchablepdf
. Read more about jpegoptim
here: https://www.tecmint.com/optimize-and-compress-jpeg-or-png-batch-images-linux-commandline/.
Here's an example demonstrating how to install jpegoptim
, use it to compress an entire directory of jpeg images, then call pdf2searchablepdf
to turn the whole directory of images into one single, searchable pdf. Assume your directory containing all the jpeg images is called "dir_of_imgs". Be sure no spaces exist anywhere in its path!
Install jpegoptim
:
sudo apt update
sudo apt install jpegoptim
Compress all the images, then convert all of them to a single, searchable PDF:
jpegoptim --size=500k dir_of_imgs/*.jpg # Compress the whole dir of images! NB: `--size=300k` is
# about the smallest I'd go.
pdf2searchablepdf dir_of_imgs # Now make 1 searchable pdf out of all of them!
For my particular case, with 7 jpeg images originally in the 2.5 to 3MB size range, the end result without jpegoptim was a 20 MB PDF, which is too large to email! By calling jpegoptim --size=500k
as shown above, first, it shrunk the image size to approx. 500kB each, which meant the final PDF size was about 3.5MB instead of 20MB! Big improvement! Now I can email the file, and the images still look pretty good!
Compress your post-processed PDF:
Update: as of v0.6.0, just use the -c
or --compress
option. Ex:
pdf2searchablepdf -c input.pdf
# or (same thing)
pdf2searchablepdf --compress input.pdf
See also my answer here: AskUbuntu.com: How can I reduce the file size of a scanned PDF file?.
Quick Start:
See here: https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
Tested on Ubuntu 18.04 and 20.04.
Install:
-
Install dependencies
sudo apt update sudo apt install tesseract-ocr ghostscript
-
Download the repository, and run the install script:
git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git cd PDF2SearchablePDF ./install.sh
-
Log out and log back in if that script just created your
~/bin
dir (ifls ~/bin
shows only the onepdf2searchablepdf
symlink in that dir, and nothing else, then this is likely the case).Note about what this does: this step simply causes the
~/.profile
file in Ubuntu to add~/bin
to your executable PATH, so long as the~/bin
dir exists. If you need to manually add the~/bin
dir to your PATH (because you're using a different Linux distribution, for instance, which does not use the~/.profile
file like this) you can run this command to add it to your path just in the terminal you have open:PATH="$HOME/bin:$PATH"
OR, you can add this to the bottom of your
~/.bashrc
file, then close and re-open your terminal. Note: this is copied from Ubuntu's default~/.profile
file:# set PATH so it includes user's private bin if it exists if [ -d "$HOME/bin" ] ; then PATH="$HOME/bin:$PATH" fi
-
(Optional, but recommended) run tests.
./run_tests.sh # Then, manually visually scan the output messages and inspect the # output searchable PDF files to ensure everything looks like it worked # correctly.
-
Lastly, ensure you do NOT delete the PDF2SearchablePDF repository you downloaded, as the install script didn't copy the executable out of it, it created an executable symlink which points to it.
Uninstall:
Uninstallation is simple, if desired. You just need to run the commands below to delete a few things, or delete those things manually using your favorite file manager, such as nemo (see my detailed installation instructions for nemo in Ubuntu here).
# 1. delete the symlink in ~/bin
rm ~/bin/pdf2searchablepdf
# 2. (Optional) delete the entire PDF2SearchablePDF repository directory, and all contents
# in it. WARNING! CHOOSING THE WRONG PATH HERE WILL erase everything in the folder you specify,
# so BE VERY CAUTIOUS!
rm -rf path/to/PDF2SearchablePDF
# 3. (Optional) remove dependencies
sudo apt remove tesseract-ocr
Use:
pdf2searchablepdf mypdf.pdf
You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text! See pdf2searchablepdf -h
for many more usage examples.
Done. The wrapper has no python dependencies, as it's currently written entirely in bash.
Dependencies:
This has been tested on Ubuntu 18.04 and 20.04. It requires the following programs:
You Must Install these:
sudo apt update
sudo apt install tesseract-ocr
See: https://github.com/tesseract-ocr/tesseract/wiki
It also relies on these, but they come pre-installed on Ubuntu:
pdftoppm
PDF2SearchablePDF Installation:
Simply run the "install.sh" script to create a symbolic link to pdf2searchablepdf
in your ~/bin
directory:
./install.sh
In short, just follow the "Install" instructions above under the "Quick Start" section.
Sample run and output:
$ pdf2searchablepdf ./test_pdfs/test1.pdf
pdf2searchablepdf version 0.1.0
=================================================================================
Converting input PDF (./test_pdfs/test1.pdf) into a searchable PDF
=================================================================================
Creating temporary working directory: "pdf2searchablepdf_temp_20191111-001431.509915114"
Converting input PDF to a bunch of output TIF images inside temporary working directory.
- THIS COULD TAKE A LONG TIME (up to 45 sec or so per page)! Manually watch the temporary
working directory to see the pages created one-by-one to roughly monitor progress.
- NB: each TIF file created is ~25MB, so ensure you have enough disk space for this
operation to complete successfully.
All TIF files created.
Running tesseract OCR on all generated TIF images in the temporary working directory.
This could take some time.
Searchable PDF will be generated at "./test_pdfs/test1_searchable.pdf".
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 0 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-1.tif
Page 1 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-2.tif
Page 2 : pdf2searchablepdf_temp_20191111-001431.509915114/pg-3.tif
Done! Searchable PDF generated at "./test_pdfs/test1_searchable.pdf".
Removing temporary working directory at "pdf2searchablepdf_temp_20191111-001431.509915114".
Done!
Total script run-time: 136 sec
END OF pdf2searchablepdf.
Changelog
- Newest on top
- Follows Semantic Versioning: MAJOR.MINOR.PATCH; see: https://semver.org/ for rules & FAQ.
- The 6 most common recommended types of changes are (see here: https://keepachangelog.com/en/1.0.0/): Added, Changed, Deprecated, Removed, Fixed, Security
INITIAL DEVELOPMENT PHASE:
- Use version numbers 0.MINOR.PATCH for the initial development phase; ex: 0.1.0, 0.2.0, etc.
- Increment just the MINOR version number for each new 0.y.z development phase enhancement, until the project is mature enough that you choose to move to a 1.0.0 release
- You may increment the PATCH number for bug fixes to your development code, or just increment the MINOR version number if there are also enhancements
MORE MATURE PHASE:
- As the project matures, release a 1.0.0 version
- Once you release a 1.0.0 version, do the following (copied from semver.org):
- Given a version number MAJOR.MINOR.PATCH, increment the:
- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards compatible manner, and
- PATCH version when you make backwards compatible bug fixes.
[v0.7.0] - 2022-11-04
- Added more error messages to help you upgrade it if your version of poppler (which contains
pdftoppm
) is out-of-date. See:
[v0.6.0] - 2022-10-12
- Added
-c
/--compress
option to output compressed copies of the output PDF as well.- Partially fulfills this ticket: #11
- Post-processing the PDF is a crude way to do it, but it's better than nothing. A better way to do it in the future would be to do OCR on the high-quality images and output the data to an intermediate format, then compress the images as desired and overlay the output OCR data onto the custom-compressed images. That will have to be future work.
[v0.5.0] - 2021-03-02
- Massively improved the way argument parsing is done.
- Added additional parsing options for debug prints and converting user-password-protected PDFs. Use the
-upw <password>
option to pass in a PDF's user password to be able to open and convert it. This works great on my password-protected home mortgage documents scanned and sent to me from the title company!
[v0.4.0] - 2020-03-14
- Updated install.sh & pdf2searchablepdf.sh scripts to allow spaces in path names; fixes issue #6
- Move argument parsing code into
parse_args()
function inside pdf2searchablepdf.sh - Moved all main code into
main()
function inside pdf2searchablepdf.sh, and addedtime
command to the call tomain
to time how longmain
takes to run
[v0.3.0] - 2019-12-29
- Added a big new feature to allow the user to convert a whole directory containing a bunch of images into a single, searchable pdf!
- New usage:
pdf2searchablepdf <input.pdf | dir_of_imgs> [lang]
- Also added print of run duration at end in units of minutes too instead of just seconds.
[v0.2.0] - 2019-12-29
- Improved help menu, which is accessible via:
pdf2searchablepdf -h
orpdf2searchablepdf -?
orpdf2searchablepdf
- Added ability to set the OCR language; new usage:
pdf2searchablepdf <input.pdf> [lang]
[v0.1.0] - 2019-11-10
- Initial release. It works!
- Can only convert a pdf to a searchable pdf in English, which is tesseract's default setting.
- Usage:
pdf2searchablepdf <input.pdf>
Alternative Software:
- See my issue here: #5. Are these alternatives better than my project here? Do I offer something they don't? Should I continue this project or just switch to using one of the projects listed above? I need to investigate and find out more!
KEYWORDS
(to make this repo more "Googlable"):
pdf 2 searchable pdf, pdftosearchablepdf, pdf to searchable pdf, make pdf searchable, perform ocr on pdf to make it searchable, extract text from pdf, pdf to text, how to make a PDF document searchable, how to make an unsearchable PDF document searchable, how to perform OCR (Optical Character Recognition) on a PDF image, linux convert directory of images into a single pdf, linux convert images to pdf, images to pdf, images2pdf, linux convert a folder of images into a single pdf, tif to pdf, tiff to pdf, png to pdf, bmp to pdf, jpg to pdf, jpeg to pdf, folder of pictures to pdf, ocr on pictures, ocr on images, pictures ocr to searchable pdf