• Stars
    star
    1,619
  • Rank 28,903 (Top 0.6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

pdf2image

CircleCI PyPI version codecov Downloads GitHub CI

A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

pip install pdf2image

Windows

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Mac

Mac users will have to install poppler.

Installing using Brew:

brew install poppler

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using conda)

  1. Install poppler: conda install -c conda-forge poppler
  2. Install pdf2image: pip install pdf2image

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

OR

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

What's new?

  • Allow users to hide attributes when using pdftoppm with hide_attributes (Thank you @StaticRocket)
  • Fix console opening on Windows (Thank you @OhMyAgnes!)
  • Add timeout parameter which raises PDFPopplerTimeoutError after the given number of seconds.
  • Add use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.
  • Fixed a bug where using pdf2image with multiple threads (but not multiple processes) would cause and exception
  • jpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)
  • pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLI
  • paths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
  • size parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
    • size=400Β will fit the image to a 400x400 box, preserving aspect ratio
    • size=(400, None) will make the image 400 pixels wide, preserving aspect ratio
    • size=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratio
  • grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
  • Sometimes fail read pdf signed using DocuSign, Solution for DocuSign issue.

More Repositories

1

TextRecognitionDataGenerator

A synthetic data generator for text recognition
Python
3,268
star
2

CRNN

A TensorFlow implementation of https://github.com/bgshih/crnn
Python
298
star
3

pdf2image-as-a-service

Deploying a basic application on GCP, AWS and Azure
Shell
59
star
4

ML-IDS

An IDS implementation using machine learning
Python
36
star
5

NRTR

A TensorFlow implementation of NRTR, a No-Recurrence Seq2Seq Model for Scene Text Recognition
Python
30
star
6

ki4a

SSH tunneling app with DNS forwarding. Based on https://github.com/staf621/ki4a
Java
28
star
7

NaiveCNN

A naive (very simple!) implementation of a convolutional neural network
Python
20
star
8

opencv-mser

A working example of OpenCV 3 MSER detector
Python
14
star
9

disklist

A python list implementation that uses the disk to handle very large collections
Python
14
star
10

MobileNetV3

A tensorflow implementation of the paper "Searching for MobileNetV3" with the R-ASPP segmentation head
Python
13
star
11

BitcoinRNN

A Recurrent Neural Network using Tensorflow to predict Bitcoin price
Python
11
star
12

AlphaMissenseCheck

See how pathogenic your mutations are according to AlphaMissense based on your 23andme raw data
Python
9
star
13

raytracing

Using CUDA to implement "Raytracing in one weekend" by Peter Shirley
Cuda
5
star
14

seal-rs

Experiments on using Microsoft SEAL library in Rust
Rust
4
star
15

air-quality-station

Combining the SNS011 sensor with an OrangePI to display PM2.5 and PM10 air quality measurements
Python
4
star
16

wikipedia2text

A tool to convert a Wikipedia dump file into plain text
Python
3
star
17

hdbscan

A go implementation of HDBSCAN
Go
3
star
18

dotfiles

Collection of dotfiles for vim, vscode, git, etc...
Shell
2
star
19

ebird

Detecting bird presence from satellite images
Python
2
star
20

TextRecognitionDataGeneratorDocs

Documentation for the TextRecognitionDataGenerator tool
JavaScript
2
star
21

Scanner3D

Using learned and non-learned algorithms to reconstruct 3D objects with the SR300 camera
Python
1
star
22

CubePlanet

Minecraft clone in C++
C++
1
star
23

SentimentRNN

A recurrent network that uses word embeddings to do sentiment analysis in both French and English
Python
1
star
24

go-home

Pun intended
Go
1
star
25

reddit-json-dump-parser

A parser for the reddit data dump
Python
1
star
26

WebcamEyeTracking

Track your eye movements with your webcam
Python
1
star
27

go-link-shortener

A basic link shortening service written in go
Go
1
star