• This repository has been archived on 15/Jun/2023
  • Stars
    star
    1,002
  • Rank 45,804 (Top 1.0 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

PDFx

Build status for master branch image image

Introduction

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

  • Extract references and metadata from a given PDF
  • Detects pdf, url, arxiv and doi references
  • Fast, parallel download of all referenced PDFs
  • Find broken hyperlinks (using the -c flag) (more)
  • Output as text or JSON (using the -j flag)
  • Extract the PDF text (using the --text flag)
  • Use as command-line tool or Python package
  • Compatible with Python 2 and 3
  • Works with local and online pdfs

Getting Started

Grab a copy of the code with easy_install or pip, and run it:

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

Examples

Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

References: 36
- URL: 18
- PDF: 18

PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf

You can use the -v flag to output all references instead of just the PDFs.

Download all referenced pdfs with -d (for download-pdfs) to the specified directory (eg. to /tmp/):

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...

To extract text, you can use the -t flag:

# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t

# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt

To check for broken links use the -c flag:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c

[Example (with video) of checking for broken links](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/).

Usage as Python library

>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")

Dev & Contributing

# Setup venv
python3 -m venv
venv . venv/bin/activate

# Install PDFx and dev deps
pip install -e .
pip install -r requirements_dev.txt

# Run tests and checks
make test
make lint
make check

# Format the code (with black)
make format

Releasing

  • Update version number in setup.py and pdfx/__init__.py
  • Create a git tag starting with v (eg. git tag v1.5.9)
  • Push the tag to GitHub: git push --tags

GitHub Actions is then publishing to PyPI.

Various

Feedback, ideas and pull requests are welcome!

Improvement Ideas

Possible:

  • Timeout (see #43)
  • Cuts off links that span two lines #40
  • Include Check-Links Results in Output #39

More Repositories

1

logzero

Robust and effective logging for Python 2 and 3.
Python
1,028
star
2

typescript-boilerplate

A modern TypeScript project setup, for Node.js and browsers
TypeScript
586
star
3

vue-highlightjs

Syntax highlighting with highlight.js for Vue.js 2.x
JavaScript
336
star
4

RPIO

RPIO is a GPIO toolbox for the Raspberry Pi.
C
325
star
5

appengine-boilerplate

Boilerplate setup for App Engine with html5-boilerplate 2.0, OpenID, memcache, user preferences, and more
JavaScript
185
star
6

flashbotsrpc

Golang client for Flashbots Relay, mev-geth and standard Ethereum JSON-RPC API endpoints
Go
136
star
7

eth-go-bindings

Go bindings for Ethereum smart contracts: ERC20, ERC165 and ERC721, ERC777, ERC1155
Go
87
star
8

py2app

Fork of the original py2app from https://bitbucket.org/ronaldoussoren/py2app
Python
72
star
9

flashbots

Flashbots utilities in Go: Blocks & Transactions API, and tools to spot bundle and block irregularities
Go
42
star
10

flashbots-ethers-example

Flashbots Ethers TypeScript example for Node.js and browser
TypeScript
37
star
11

most-simple-ajax-chat-ever

Fun project from back in 2006 ^^
HTML
30
star
12

micropython-ctl

TypeScript library for talking to MicroPython devices from websites/webapps, Node.js and Electron apps
TypeScript
27
star
13

raspberrypi-utils

Collection of utilities for the Raspberry Pi
Python
21
star
14

go-ethutils

Go helpers for working with Ethereum.
Go
20
star
15

feedmailer

RSS to Email Webapp (Python, AppEngine)
Python
17
star
16

retrofit2-samples

Samples for Retrofit 2.0
Java
15
star
17

wpscanner

Wordpress security scanner written in Python
Python
13
star
18

micropython-captiveportal

Minimal async captive portal for MicroPython (compatible with uasyncio v3/MicroPython 1.13+ as well as earlier versions)
Python
11
star
19

eth-was-tx-uncled

Go
11
star
20

django-boilerplate

Django Boilerplate
Python
10
star
21

binary-serializer

Minimalistic binary serialization protocol (a la protocol buffers) with Python and Java implementations.
Python
9
star
22

rfid-music-player

A simple RFID music player for kids (runs on a Raspberry Pi)
Python
9
star
23

coz-python-smart-contract-workshop

8
star
24

raspberrypi-django

Raspberry Pi Django Setup
Python
8
star
25

python-posix-daemon

Simple and efficient Python daemon framework based on Sander Marechal's code.
Python
7
star
26

android-bluetooth-spp

Bluetooth Serial Communication with Android + PC/Arduino
Python
7
star
27

jekyll-boilerplate

CSS
7
star
28

metalab-git-workshop

Repository for the GIT Workshop at Metalab
6
star
29

android-apprater

Prompt engaged users to rate your app in the Android market
Java
5
star
30

flashbots-tx-telegram-bot

Telegram bot to notify about failed 0-gas and flashbots tx
Go
3
star
31

GridGrouper

Group similar members close to each other in a grid
Python
2
star
32

tornado-boilerplate

Boilerplate for the tornado web framework
JavaScript
2
star
33

pdfx-gui

PyQt/QML Gui for PDFx
Python
2
star
34

androidsnippets.com

Rewrite of androidsnippets.com for relaunch
Python
2
star
35

photoblog

Python
2
star
36

micropython-minimal-webserver-asyncio3

Minimal MicroPython webserver / TCP connection handler using uasyncio v3 (MicroPython 1.13+), with fallback for earlier versions of uasyncio / MicroPython
Python
2
star
37

infomatic

Software for Picture in Picture Mode with a Raspberry Pi
Python
1
star
38

photoblog-styles

1
star
39

pifm

C
1
star
40

HN-Reduce

Chrome extension to remove stories on HN
JavaScript
1
star
41

Climbing-Log

Java
1
star
42

clicklooper

Raspberry Pi audio player which jumps to next album on mouse click
Python
1
star
43

metachris.github.com

JavaScript
1
star
44

raspberrypi-tools

Shell
1
star
45

python-tornado-tcp-boilerplate

Python
1
star
46

python-raspberrypi-boilerplate

Boilerplate Python Project for the Raspberry Pi
Python
1
star
47

Snapshotter

web.py app that takes snapshots from a website and stores them in S3
Python
1
star
48

pypi-downloads-tracker

Python
1
star
49

csspivot

JavaScript
1
star
50

boilerplates

various boilerplates
Vim Script
1
star
51

neo-cli-ci

Continuous integration setup for neo-cli
Shell
1
star