• Stars
    star
    183
  • Rank 210,154 (Top 5 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created about 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A general purpose PDF text-layer redaction tool for Python 2/3.

pdf-redactor

A general-purpose PDF text-layer redaction tool, in pure Python, by Joshua Tauberer and Antoine McGrath.

pdf-redactor uses pdfrw under the hood to parse and write out the PDF.


This Python module is a general tool to help you automatically redact text from PDFs. The tool operates on:

  • the text layer of the document's pages (content stream text)
  • plain text annotations
  • link target URLs
  • the Document Information Dictionary, a.k.a. the PDF metadata like Title and Author
  • embedded XMP metadata, if present

Graphical elements, images, and other embedded resources are not touched.

You can:

  • Use regular expressions to perform text substitution on the text layer (e.g. replace social security numbers with "XXX-XX-XXXX").
  • Rewrite, remove, or add new metadata fields on a field-by-field basis (e.g. wipe out all metadata except for certain fields).
  • Rewrite, remove, or add XML metadata using functions that operate on the parsed XMP DOM (e.g. wipe out XMP metadata).

How to use pdf-redactor

Get this module and then install its dependencies with:

pip3 install -r requirements.txt

pdf_redactor.py processes a PDF given on standard input and writes a new, redacted PDF to standard output:

python3 pdf_redactor.py < document.pdf > document-redacted.pdf

However, you should use the pdf_redactor module as a library and pass in text filtering functions written in Python, since the command-line version of the tool does not yet actually do anything to the PDF. The example.py script shows how to redact Social Security Numbers:

python3 example.py < tests/test-ssns.pdf > document-redacted.pdf

Limitations

Not all content may be redacted

The PDF format is an incredibly complex data standard that has hundreds, if not thousands, of exotic capabilities used rarely or in specialized circumstances. Besides a document's text layer, metadata, and other components of a PDF document which this tool scans and can redact text from, there are many other components of PDF documents that this tool does not look at, such as:

  • embedded files, multimedia, and scripts
  • rich text annotations
  • forms
  • internal object names
  • digital signatures

There are so many exotic capabilities in PDF documents that it would be difficult to list them all, so this list is a very partial list. It would take a lot more effort to write a redaction tool that scanned all possible places content can be hidden inside a PDF besides the places that this tool looks at, so please be aware that it is your responsibility to ensure that the PDFs you use this tool on only use the capabilities of the PDF format that this tool knows how to redact.

Character replacement

One of the PDF format's strengths is that it embeds font information so that documents can be displayed even if the fonts used to create the PDF aren't available when the PDF is viewed. Most PDFs are optimized to only embed the font information for characters that are actually used in the document. So if a document doesn't contain a particular letter or symbol, information for rendering the letter or symbol is not stored in the PDF.

This has an unfortunate consequence for redaction in the text layer. Since redaction in the text layer works by performing simple text substitution in the text stream, you may create replacement text that contains characters that were not previously in the PDF. Those characters simply won't show up when the PDF is viewed because the PDF didn't contain any information about how to display them.

To get around this problem, pdf_redactor checks your replacement text for new characters and replaces them with characters from the content_replacement_glyphs list (defaulting to ?, #, *, and a space) if any of those characters are present in the font information already stored in the PDF. Hopefully at least one of those characters is present (maybe none are!), and in that case your replacement text will at least show up as something and not disappear.

Content stream compression

Because pdfrw doesn't support all content stream compression methods, you should use a tool like qpdf to decompress the PDF prior to using this tool, and then to re-compress and web-optimize (linearize) the PDF after. The full command would be something like:

qpdf --stream-data=uncompress document.pdf - \
 | python3 pdf_redactor.py > /tmp/temp.pdf
 && qpdf --linearize /tmp/temp.pdf document-redacted.pdf

(qpdf's first argument can't be standard input, unfortunately, so a one-liner isn't possible.)

Exotic fonts

This tool has a limited understanding of glyph-to-Unicode codepoint mappings. Some unusual fonts may not be processed correctly, in which case text layer redaction regular expressions may not match or substitution text may not render correctly.

Testing that it worked

If you're redacting metadata, you should check the output using pdfinfo from the poppler-utils package:

# check that the metadata is fully redacted
pdfinfo -meta document-redacted.pdf

Developing/testing the library

Tests require some additional packages:

pip install -r requirements-dev.txt
python tests/run_tests.py

The file tests/test-ssns.pdf was generating by converting the file tests/test-ssns.odft to PDF in LibreOffice with the Archive PDF/A-1a option turned on so that it generates XMP metadata and Export comments turned on to export the comment.

More Repositories

1

python-email-validator

A robust email syntax and deliverability validation library for Python.
Python
1,095
star
2

pdf-diff

A PDF comparison utility in Python.
Python
446
star
3

jot

JSON Operational Transformation (JOT)
JavaScript
353
star
4

convert-outlook-msg-file

Python library to convert Microsoft Outlook .msg files to .eml/MIME message files.
Python
179
star
5

hackathon.guide

A logistics guide to running a successful hackathon.
HTML
176
star
6

rdfabout

Archival. Things I wrote about RDF from the mid-2000's. The validator is no longer maintained, sorry.
109
star
7

fast_diff_match_patch

Python package for Google's diff-match-patch native C++ implementation.
Python
73
star
8

crs-reports-website

The build process for EveryCRSReport.com.
Python
63
star
9

praat-py

From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language.
C
61
star
10

xml_diff

Compares two XML documents by diffing their text.
Python
40
star
11

why-use-cartograms

Analysis for a blog post on cartograms.
Python
29
star
12

party-platforms

The 2012 Democratic, Libertarian, and Republican Party platforms, plus every Democratic platform since 1840, cleaned up into nice XML.
26
star
13

parsey-mcparseface-server

[Archive] A simple Python Flask app to run Parsey McParseface.
Python
25
star
14

cmusphinx-alignment-example

How I got cmusphinx's transcript alignment tool to work.
Java
25
star
15

cartogrid

A grid-based cartogram generator.
Python
14
star
16

opengovdata.org

The website opengovdata.org.
CSS
14
star
17

globe-gores

Globe gores, in Javascript.
JavaScript
12
star
18

dc-code-editor

Prototype tool for editing the DC Code.
JavaScript
9
star
19

wmata-track-locations

WMATA Track Geospatial GIS Location Data
Python
9
star
20

dc-code-prototype

Unofficial Code of the District of Columbia in XML, produced under contract with the Council of the District of Columbia. Last updated in 2014.
7
star
21

crs-reports-scraper

Downloads Congressional Research Service (CRS) reports from the CRS.gov website (which is only visible from within the U.S. Capitol computer network).
HTML
7
star
22

thunderbird-spf

Archival: An anti-phishing/anti-spam Mozilla Thunderbird 3 extension for doing Sender Policy Framework (SPF) checks on incoming mail.
JavaScript
7
star
23

semweb-dotnet

Archival: A C#/.NET library for manipulating RDF. No longer in active development.
C#
6
star
24

s-p-500-simulator

Simulates an investor randomly choosing S&P 500 stocks.
Python
6
star
25

historical-state-population-csv

Historical Population of the U.S. States 1900-present in a CSV Spreadsheet
Python
6
star
26

django-annotator-store

A Django backend for okfn/annotator storage.
Python
6
star
27

printable-district-maps

High-resolution, print-quality congressional district maps and an example of loading Open Street Map (OSM) into Postgres.
Python
6
star
28

official.dccode.gov

The future website for https://official.dccode.gov.
Shell
5
star
29

nyc-traffic

An analysis of New York City traffic patterns on the arterial roads.
Python
5
star
30

color-scales

Color Scale Generator Using a Perceptually Valid Color Space
HTML
5
star
31

opengovdata.io

The website for my book, Open Government Data: The Book.
HTML
4
star
32

myhomepage

My (@JoshData's) homepage.
HTML
4
star
33

html5-stub

An HTML5/Bootstrap website template for starting new projects.
HTML
4
star
34

endsecretlaws

This is how I feel about surveillance.
CSS
3
star
35

infinite-tree

An infinite tree.
HTML
3
star
36

dchbx

DCHBX Health Exchange Plans
Python
3
star
37

exclusiveprocess

A simple Python 3 module for ensuring that your code does not execute concurrently in multiple processes, using POSIX file locking.
Python
3
star
38

django-pubmybook

A Django website for publishing a LaTeX book online in HTML.
Python
3
star
39

marcos

A generative model for natural language using a markov chain over syntactic relations, rather than serial order.
Python
3
star
40

wobblegram

A Python module to create a wigglegram, which is a sort of steeographic image, using a "MPO" file as input, which is created by some cameras.
Python
2
star
41

my2012district

The website my2012district.com, which helps U.S. voters find their new 2012 congressional district.
JavaScript
2
star
42

cotaskme

A task list where every task for you also appears "outgoing" on the task list of the person who requested the task. Based on an idea by Matthew Burton.
Python
2
star
43

datastore-loader

Utility script to load tabular data into the CKAN Datastore.
Python
2
star
44

dc.opendataday.org

The website for Open Data Day DC.
HTML
1
star
45

dc-bega-emails

Emails in 2017-2018 retreived through DC FOIA requests related to the Board of Ethics and Government Accountability's Office of Open Government.
1
star
46

JoshData

Config files for my GitHub profile.
1
star
47

apophenia-python

Python
1
star
48

census2000-to-rdf

(Archival) Perl script to turn the 2000 US Census into RDF.
Perl
1
star
49

dc-street-henge

Like Manhattanhenge but for the District of Columbia. For each day of the year identifies DC streets that line up with sunrise or sunset.
Python
1
star
50

battlelibs

A mad libs helper for Battledecks.
1
star
51

django-database-storage-backend

A Django 1.7-1.10 storages backend backed by your existing database.
Python
1
star
52

browser-padlock-guide

A Javascript library to render an example of a browser security padlock.
CSS
1
star
53

arfticle-three

Uhm. Too much time spent on this.
Python
1
star
54

py-fist-pump

Given 3D accelerometer data, compute the frequency of rhythmic motion and predict the next beat
Python
1
star
55

readlet

A bookmarklet that creates a Spritz speed-reading "reticule" for any web page you are viewing.
JavaScript
1
star
56

alexa-transit-times

An Alexa skill for getting the next WMATA Metro rail or bus times for your common trips.
JavaScript
1
star