PDF to TXT (with OCR)
Given one or more PDFs that may include text-as-image content, use OCR (Optical Character Recognition) to convert the content to TXT files (in UTF-8 encoding).
Rationale
A survey of existing PDF-to-TXT solutions found no extant solutions that meet all of the following criteria:
- is an offline tool (to keep secure human-subject information)
- provides conversion from PDF to TXT (most existing OCR integrations assume an image as input)
- supports batch processing of multiple files
Assumptions
- This is (currently) a command-line tool, written in Python. Basic familiarity with executing commands in a terminal, as well as directory structure, is assumed.
- It is assumed that you have Python version 3.x installed, as well as Pip.
- This script relies on an industry-standard OCR library managed by Google, called Tesseract. Since it is written in C++, for Python to be able to use it, it needs to be installed separately (instructions below). Similarly, a PDF-to-image library, Poppler, will need to be installed on Windows and Mac systems.
Setup
Windows
- Make a new folder on your Desktop called
ocr
(e.g.,C:\Users\mark\Desktop\ocr
) - Download and install the Tesseract 4 OCR library from Tesseract at UB Mannheim
- The installation should indicate which directory Tesseract-OCR was installed. Most likely, this will either be
C:\Program Files (x86)\Tesseract-OCR
orC:\Program Files\Tesseract-OCR
. Move this folder into your equivalent ofC:\Users\mark\Desktop\ocr
, so that it is now located atDesktop\ocr\Tesseract-OCR
. - Download poppler for Windows.
- You may need to install 7Zip to unzip the executable, as well.
- Place the unzipped files in
Desktop\ocr\poppler-0.68.0_x86
). - From your start menu, navigate to Control Panel > System and Security > System > Advanced System Settings
- Then click Environment Variables.
- In the System Variables window, highlight Path, and click Edit.
- Click New to add an additional path.
- Paste the full path to the location of Tesseract (e.g.,
C:\Users\mark\Desktop\\ocr\Tesseract-OCR
) and press OK. - Again, click New to add an additional path.
- Paste your equivalent of
C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin
and press OK. - Press OK on any remaining control panel windows.
- Download OCR2Text to
Desktop\ocr
). - Unzip the project.
- Open a
cmd.exe
terminal, and navigate to the folder via the command line (e.g.,cd Desktop\ocr\ocr2text-master
) - Run
pip install --user --requirement requirements.txt
- Optionally, you can check that you set up the PATH variable correctly in steps 6-10 by typing
echo %PATH%
. The output must include your equivalent ofC:\Users\mark\Desktop\ocr\Tesseract-OCR
andC:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin
for the script to work.
macOS
- Make a new folder on your Desktop called
ocr
(i.e.,/Users/mark/Desktop/ocr
) - Install Tesseract-OCR using either MacPorts (
sudo port install tesseract
) or Homebrew (brew install tesseract
- Install poppler for Mac.
- Download this Github project to
/Users/mark/Desktop/ocr
). - Unzip the project.
- Open a terminal and navigate to the folder via the command line (e.g.,
cd /Users/mark/Desktop/ocr/ocr2text
) - Run
pip install --user --requirement requirements.txt
Linux
sudo apt-get install tesseract-ocr
- Most distros ship with
pdftoppm
andpdftocairo
. If they are not installed, refer to your package manager to installpoppler-utils
- Download this Github project.
- Unzip the project.
- Open a terminal and navigate to the folder
- Run
pip install --user --requirement requirements.txt
Usage
If you have successfully completed the setup steps and are using Python version 3, usage should now be a breeze:
On the command line, navigate to the directory where you downloaded the script and run:
python ocr2text.py
You will see the following:
********************************
*** PDF to TXT file, via OCR ***
********************************
Indicate file or folder of source PDF(s) []:
(Press [Enter] for current working directory)
Enter the full path to the file or directory to convert.
Destination folder for TXT []:
(Press [Enter] for current working directory)
Enter the full path to the directory where the result file(s) should be outputted.
The script will now covert the PDF via OCR into a plaintext file:
Testing the installation
For testing purposes, a test_files
directory is included. You can press [Enter] for the source and destination directories & verify that the image.pdf
file is converted. It will also be located in the test_files
directory:
Converted C:\Users\mark\ocr2text\image.pdf
Percent: [##########] 100%
1 file converted