• This repository has been archived on 30/Nov/2019
  • Stars
    star
    509
  • Rank 86,772 (Top 2 %)
  • Language
    Ruby
  • License
    MIT License
  • Created over 13 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MOVED TO https://gitlab.com/crossref/pdfextract

pdf-extract

A tool and library that can extract various areas of text from a PDF, especially a scholarly article PDF. It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references.

The latest version is 0.1.1. Earlier versions are far less reliable.

pdf-extract requires Ruby 1.9.1 or above.

Quick start

Install the latest version with:

$ gem install pdf-extract

Quick examples

Extract references from a PDF:

$ pdf-extract extract --references myfile.pdf

Extract references and a title from a PDF:

$ pdf-extract extract --references --titles myfile.pdf

Mark the locations of headers, footers and columns in a new PDF:

$ pdf-extract mark --columns --headers --footers myfile.pdf

Extract regions of text from a PDF, preserving line information (offsets from region origin):

$ pdf-extract extract --regions myfile.pdf

Extract regions of text from a PDF without line information (prettier and easier to read):

$ pdf-extract extract --regions --no-lines myfile.pdf

Resolve references to DOIs and output related metadata as BibTeX:

$ pdf-extract extract-bib --resolved_references myfile.pdf

Problems

pdf-extract mistakes normal text for references when attempting to extract references.

pdf-extract attempts to identify reference sections by comparing section features to an idealised model of a reference section. Sometimes this can go wrong. If pdf-extract is producing reference output that clearly includes something that is not a reference, try reducing the reference_flex slightly:

$ pdf-extract extract --references --set reference_flex:0.18 myfile.pdf

The default for reference_flex is 0.2. Make small decrements.

pdf-extract extracts no references.

As above, but try to increase the reference_flex a bit a time:

$ pdf-extract extract --references --set reference_flex:0.25 myfile.pdf

Keep trying with small increments to reference_flex. Note that a reference_flex of 1 means pdf-extract will identify all sections as reference sections.

pdf-extract is still producing weird output after fiddling with reference_flex.

Have a look at pdf-extract's settings:

$ pdf-extract settings

This command will produce a list of settings along with descriptions of what they affect. They can be set by passing a --set key:value argument to pdf-extract.

More Repositories

1

rest-api-doc

Documentation for Crossref's REST API. For questions or suggestions, see https://community.crossref.org/
740
star
2

open-funder-registry

MOVED TO https://gitlab.com/crossref/open_funder_registry
40
star
3

pdfstamp

MOVED TO https://gitlab.com/crossref/pdfstamp
Java
36
star
4

pdfmark

MOVED TO https://gitlab.com/crossref/pdfmark
Java
33
star
5

cr-search

MOVED TO https://gitlab.com/crossref/metadata_search
Ruby
26
star
6

cayenne

MOVED to https://gitlab.com/crossref/rest_api
Clojure
17
star
7

pdf2xml

Converts a PDF into an XML representation of the PDF's layout.
Java
17
star
8

reference-matching-evaluation

MOVED to https://gitlab.com/crossref/reference_matching_evaluation_framework
Jupyter Notebook
17
star
9

niso-ali

Examples of NISO ALI metadata and copies of the schema
11
star
10

jats-crossref-xslt

JATS to CrossRef deposit XML translation via an XSLT
XSLT
11
star
11

CrossMark-Examples

PDFs, Deposit Files and XMP files illustrating CrossMark Scenarios
9
star
12

reddit-dump-experiment

Experimental extraction of DOI citation information from Reddit submission dump.
Scala
8
star
13

tinypub

MOVED TO https://gitlab.com/crossref/tinypub
Ruby
7
star
14

doi-popup

Display information from DOI content negotiation header links in a popover
JavaScript
6
star
15

event-data-query

MOVED TO https://gitlab.com/crossref/event_data_query
Clojure
6
star
16

event-data-event-bus

MIGRATED
Clojure
5
star
17

event-data-reverse

RETIRED Service to transform URLs back into DOIs. This service is passive and almost-stateless.
Clojure
5
star
18

event-data-user-guide

MOVED TO https://gitlab.com/crossref/event_data_user_guide
HTML
5
star
19

citation-depositor

Deposit citations with CrossRef from PDFs.
JavaScript
4
star
20

baleen-old

Live stream of DOI citations in Wikipedia
JavaScript
4
star
21

doi-destinations-RETIRED

MIGRATED TO: https://github.com/CrossRef/event-data-reverse
Clojure
4
star
22

event-data-agents

MOVED TO https://gitlab.com/crossref/event_data_agents
Clojure
4
star
23

event-data-common

MOVED TO https://gitlab.com/crossref/event_data_common
Clojure
4
star
24

citedbyjs

A Cited-by widget that is easy to embed in any HTML page.
Ruby
3
star
25

clj-harvest

An OAI-PMH harvester
Clojure
3
star
26

mdt-ui

MOVED TO https://gitlab.com/crossref/metadata_manager
JavaScript
3
star
27

citevis

Citation visualisation web app
Ruby
3
star
28

event-data-enquiries

MOVED to https://gitlab.com/crossref/event_data_enquiries
3
star
29

go-orcid

Demo code for interacting with ORCID authentication in Go.
Go
3
star
30

dul-tool

RETIRED Proof of concept and reference implementation of Distributed Usage Logging Message Authentication Recommended Specification
Java
3
star
31

clinical-trials-importer

Scrips to extract clinical trial information from various places
Clojure
3
star
32

cayenne-solr

JavaScript
2
star
33

verify-publications

Program for comparing Journals as listed by the CrossRef Metadata API and the SCOPUS list.
Clojure
2
star
34

chronograph-RETIRED

DOI Chronograph
JavaScript
2
star
35

crossmark

MOVED TO https://gitlab.com/crossref/crossmark
CSS
2
star
36

dul-doi-staging

MOVED TO https://gitlab.com/crossref/dul_doi_mock
Clojure
2
star
37

event-data-percolator

MOVED https://gitlab.com/crossref/event_data_percolator
Clojure
2
star
38

fundref-widget

CSS
2
star
39

trial-lookup

Look up clinical trial IDs with a dozen or so trial registries
Clojure
1
star
40

funder-reconciler

And example OpenRefine reconciler for the Open Funder Registry
Ruby
1
star
41

api-browser

In-browser UI for browsing the CrossRef Metadata API
JavaScript
1
star
42

goauth2-orcid

A fork of goauth2 that works with ORCID.org authentication.
Go
1
star
43

crossref-precipitation-reports

1
star
44

raknare

DOI Resolution Logs processing with Spark and Scala
Scala
1
star
45

cr-common

DEPRECATED
Java
1
star
46

ursus-maritimus

Crossref Text and Data Mining demonstration
Python
1
star
47

grantID-schema

RETIRED
1
star
48

demo-django-project

Demo Django project with auto-generated API docs.
Python
1
star
49

logppj

MOVED TO https://gitlab.com/crossref/logppj
Java
1
star
50

event-data-evidence-log-snapshot

Snapshots the Evidence Log into archive files
Clojure
1
star
51

Prep

Participation Reports web UI
JavaScript
1
star
52

util

MOVED TO https://gitlab.com/crossref/util_clojure
Clojure
1
star
53

tobias

CrossRef OAI-PMH result data parser work flow queue thing
Ruby
1
star
54

pcv

Patent to DOI citation viewer
Ruby
1
star
55

labs-3d-points

A Crossref labs toy for plotting Events in 3D space.
JavaScript
1
star
56

event-data-reddit-links-agent

RETIRED Crossref Event Data Reddit Links agent. Subscribes to a list of Subreddits and follows the links shared on them to see if those webpages have Events. Note that this is different from the *Reddit* agent.
Clojure
1
star
57

chronograph-ui

MOVED TO https://gitlab.com/crossref/chronograph_ui
JavaScript
1
star
58

jobless

A lo-fi, no bells and whistles job queue for a more austere future.
Ruby
1
star
59

mongo-meta

Recreate indexes in MongoDB 2.0.x from a Mongo 2.1 metadata dump
Ruby
1
star
60

tobias-random

Random DOIs, served with love
Ruby
1
star
61

citation-style-classifier

MOVED to https://gitlab.com/crossref/citation_style_classifier
Jupyter Notebook
1
star
62

dul-authority-tool

RETIRED Tool for maintaining the Authority Registry in Crossref Distributed Logging (DUL) framework.
Clojure
1
star
63

deploader

Load DOI deposit files into an SQL database. Testing various languages for speed.
Scala
1
star
64

tobias-solr

Clojure
1
star
65

search-based-reference-matcher

MOVED to https://gitlab.com/crossref/search_based_reference_matcher
Java
1
star
66

event-data-heartbeat

MOVED TO https://gitlab.com/crossref/event_data_heartbeat
Clojure
1
star
67

event-data-investigator

MOVED TO https://gitlab.com/crossref/event_data_investigator
Clojure
1
star
68

unreliable

A deliberately unreliable server that we use for testing
Python
1
star