• Stars
    star
    860
  • Rank 53,022 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple PDF text extraction

pdftotext

PyPI Tests Downloads

Simple PDF text extraction

import pdftotext

# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

OS Dependencies

These instructions assume you're using Python 3 on a recent OS. Package names may differ for Python 2 or for an older OS.

Debian, Ubuntu, and friends

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel

macOS

brew install pkg-config poppler python

Windows

Currently tested only when using conda:

  • Install the Microsoft Visual C++ Build Tools
  • Install poppler through conda:
    conda install -c conda-forge poppler
    

Install

pip install pdftotext