• Stars
    star
    1,731
  • Rank 25,833 (Top 0.6 %)
  • Language
    Ruby
  • License
    MIT License
  • Created almost 16 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

It provides programmatic access to the contents of a PDF file with a high degree of flexibility.

The PDF 1.7 specification is a weighty document and not all aspects are currently supported. I welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving our support.

This is primarily a low-level library that should be used as the foundation for higher level functionality - it's not going to render a PDF for you. There are a few exceptions to support very common use cases like extracting text from a page.

Installation

The recommended installation method is via Rubygems.

gem install pdf-reader

Usage

Begin by creating a PDF::Reader instance that points to a PDF file. Document level information (metadata, page count, bookmarks, etc) is available via this object.

reader = PDF::Reader.new("somefile.pdf")

puts reader.pdf_version
puts reader.info
puts reader.metadata
puts reader.page_count

PDF::Reader.new accepts an IO stream or a filename. Here's an example with an IO stream:

require 'open-uri'

io     = open('http://example.com/somefile.pdf')
reader = PDF::Reader.new(io)
puts reader.info

If you open a PDF with File#open or IO#open, I strongly recommend using "rb" mode to ensure the file isn't mangled by ruby being 'helpful'. This is particularly important on windows and MRI >= 1.9.2.

File.open("somefile.pdf", "rb") do |io|
  reader = PDF::Reader.new(io)
  puts reader.info
end

PDF is a page based file format, so most visible information is available via page-based iteration

reader = PDF::Reader.new("somefile.pdf")

reader.pages.each do |page|
  puts page.fonts
  puts page.text
  puts page.raw_content
end

If you need to access the full program for rendering a page, use the walk() method of PDF::Reader::Page.

class RedGreenBlue
  def set_rgb_color_for_nonstroking(r, g, b)
    puts "R: #{r}, G: #{g}, B: #{b}"
  end
end

reader   = PDF::Reader.new("somefile.pdf")
page     = reader.page(1)
receiver = RedGreenBlue.new
page.walk(receiver)

For low level access to the objects in a PDF file, use the ObjectHash class like so:

reader  = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect

Text Encoding

Regardless of the internal encoding used in the PDF all text will be converted to UTF-8 before it is passed back from PDF::Reader.

Strings that contain binary data (like font blobs) will be marked as such.

Former API

Version 1.0.0 of PDF::Reader introduced a new page-based API that provides efficient and easy access to any page.

The pre-1.0 API was deprecated during the 1.x release series, and has been removed from 2.0.0.

Exceptions

There are two key exceptions that you will need to watch out for when processing a PDF file:

MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn't raise an exception, please forward a copy of the file to the maintainers (preferably via the google group) and we will attempt to improve the code.

UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements.

MalformedPDFError has some subclasses if you want to detect finer grained issues. If you don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.

Any other exceptions should be considered bugs in either PDF::Reader (please report it!).

PDF Integrity

Windows developers may run into problems when running specs due to MalformedPDFError's This is usually because CRLF characters are automatically added to some of the PDF's in the spec folder when you checkout a branch from Git.

To remove any invalid CRLF characters added while checking out a branch from Git, run:

rake fix_integrity

Maintainers

Licensing

This library is distributed under the terms of the MIT License. See the included file for more detail.

Mailing List

Any questions or feedback should be sent to the PDF::Reader google group. It's better that any answers be available for others instead of hiding in someone's inbox.

http://groups.google.com/group/pdf-reader

Examples

The easiest way to explain how this works in practice is to show some examples. Check out the examples/ directory for a few files.

Alternate Decoder

For PDF files containing Ascii85 streams, the ascii85_native gem can be used for increased performance. If the ascii85_native gem is detected, pdf-reader will automatically use the gem.

First, run gem install ascii85_native and then require the gem alongside pdf-reader:

require "pdf-reader"
require "ascii85_native"

Another way of enabling native Ascii85 decoding is to place gem 'ascii85_native' in your project's Gemfile.

Known Limitations

Occasionally some text cannot be extracted properly due to the way it has been stored, or the use of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate an unrecognisable character.

Resources

More Repositories

1

em-ftpd

Lightweight FTP server framework built on the EventMachine
Ruby
100
star
2

graval

An experimental go FTP server framework
Go
79
star
3

pdf-preflight

Check PDF files conform to various standards
Ruby
76
star
4

db2fog

store your rails database backups in the cloud
Ruby
69
star
5

pdfreader

An experimental PDF reader for go
Go
56
star
6

puma-plugin-statsd

A puma plugin that sends key metrics to statsd
Ruby
47
star
7

onix

A convenient mapping between ruby objects and the ONIX XML specification
Ruby
39
star
8

prawn-forms

A prawn extension library for adding interactive forms
Ruby
32
star
9

pdf-wrapper

A unicode aware PDF writing library that uses the ruby bindings to various c libraries ( like cairo, pango, poppler and rsvg ) to do the heavy lifting.
Ruby
22
star
10

upc

Small library for recognising and validating UPC numbers
Ruby
21
star
11

abn

Small library for validating Australian Business Numbers
Ruby
15
star
12

ean13

A small library for generating and validating EAN-13's
Ruby
15
star
13

s3ftp

A mini FTP server that persists all data to S3
Ruby
15
star
14

cargowise

A wrapper for the Cargowise SOAP API, for tracking freight
Ruby
15
star
15

prawn-js

A small extension to prawn that simplifies embedding JavaScript in your PDF files
Ruby
13
star
16

gem-lint

Check rubygem files for common mistakes and errors
Ruby
12
star
17

pgredis

Redis in front, postgresql out back
Go
11
star
18

onix-dtd

A debian package containing various ONIX DTDs
7
star
19

buildkite-trace

A mini HTTP server that converts buildkite webhooks into datadog APM traces
Ruby
6
star
20

prawn-rails-xaccelredirect

Sample rails app that exhibits using Prawn and Ngnix's X-Accel-Redirect feature to generate and stream PDFs
Ruby
5
star
21

ean13.net

A .net library for recognising and validating EAN codes.
C#
5
star
22

strip_control_chars

a small ActiveRecord plugin that strips ASCII control chars from string attributes before saving
Ruby
5
star
23

istc

Small library for recognising and validating ISTC numbers
Ruby
4
star
24

cuecat

a small ruby library for decoding cuecat codes
Ruby
4
star
25

csv2onix

rails app for converting CSV files to ONIX files
Ruby
4
star
26

bisac

small library for parsing and generating BISAC files
Ruby
4
star
27

debian-rubinius

Ruby
4
star
28

lita-gsuite

Monitor activity and data in a gsuite account
Ruby
3
star
29

gbip

wrapper for the globalbooksinprint.com commercial API
Ruby
3
star
30

isbn10

a (very) small library for working with ISBN10 codes
Ruby
3
star
31

titlepage

Wrapper for the SOAP API at titlepage.com.au
Ruby
3
star
32

activemdb

fork of http://rubyforge.org/projects/activemdb/
Ruby
3
star
33

san

a (very) small library for working with Standard Address Numbers
Ruby
3
star
34

isni

a (very) small library for working with ISNI and ORCID
Ruby
3
star
35

dumb_quotes

a small ActiveRecord plugin that converts 'smart quotes' to their ASCII equivalents
Ruby
3
star
36

home-data

data collection from home
Go
3
star
37

wedded-wives

mini wedding registry site for my sister
Ruby
3
star
38

symmetric_file

Encrypt basic files in ruby using symmetric encryption
Ruby
2
star
39

morecane

Extra checks for cane
Ruby
2
star
40

gcp-tree

Ruby
2
star
41

lint_report

run gem_lint across all public rubygems
Ruby
2
star
42

replace_entities

An active record plugin for replacing HTML entities with UTF-8 characters
Ruby
2
star
43

amber_electric

a ruby API client for Amber Electric (https://amberelectric.com.au)
Ruby
2
star
44

editx

Library that simplifies working eith EDItX XML files in ruby
Ruby
2
star
45

rba-terminal

A debian package that simplifies setting up a RDP thin terminal
2
star
46

rvista

Basic library for reading and generating Vista HDS ecommerce files in Ruby
Ruby
2
star
47

raval

An experimental ruby FTP server built on celluloid
Ruby
2
star
48

share-rb-demo

A sinatra app that demos share.rb, a ruby port of sharejs
Ruby
2
star
49

ean8

a (very) small library for working with EAN8 codes
Ruby
2
star
50

pacstream

Small library to keeping communication with the pacstream server nice and simple
Ruby
2
star
51

subjct

A sinatra app for crowdsourcing links between BIC and BISAC subjects
Ruby
2
star
52

jswin

Attempting to work out a better way of building complex things with Javascript using Coffee, Jasmine, cakefiles & other fun stuff.
Ruby
2
star
53

lita-timing

Utilities for time related tasks in lita
Ruby
1
star
54

useragent_supports

Check if a user agent string supports a feature
Ruby
1
star