• Stars
    star
    215
  • Rank 183,925 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 13 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A more complete example of programming with PDFMiner, which continues where the default documentation stops
PDFMiner (http://www.unixuser.org/~euske/python/pdfminer/index.html)
is a pdf parsing library written in Python by Yusuke Shinyama.

In addition to the pdf2txt.py and dumppdf.py command line tools, there
is a way of analyzing the content tree of each page programmatically.

This is a more complete example of programming with
PDFMiner, which continues where the default documentation
(http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout)
stops.

This code is still a work-in-progress, with room for improvement.

Usage: import layout_scanner and call get_toc() for a list of the table
of contents, and get_pages() for the full text.

Here are some examples using the Python shell:

>>> import layout_scanner
>>> toc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf')
>>> len(toc)
  ... should return the number of elements in the pdf document's table
  of contents (or 0 if there is no TOC)
>>> toc[0]
  ... a tuple containing the ordinal sequence and the title string,
  for example:
(1, u'Introduction')
>>> pages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf')
>>> len(pages)
  ... should return the number of pages in the pdf document
>>> pages[0]
  ... a string of all the text on the first page

Room for Improvement

 * Column Merging - while the fuzzy heuristic I described works well for
 the pdf files I've parsed so far, I can imagine more complex documents
 where it would break-down (perhaps this is where the analysis should be
 more sophisticated, and not ignore so many types of pdfminer.layout.LT*
 objects).

 * Image Extraction - I'd like to be able to be at least as good as
 pdftoimages, and save every file in ppm or pnm default format, but I'm
 not sure what I could be doing differently

 * Title and Heading Capitalization - this seems to be an issue with
 PDFMiner, since I get similar results in using the command line tools,
 but it is annoying to have to go back and fix all the mis-capitalizations
 manually, particularly for larger documents.

 * Title and Heading Fonts and Spacing - a related issue, though probably
 something in my own code, is that those same title and paragraph headings
 aren't distinguished from the rest of the text. In many cases, I have to
 go back and add vertical spacing and font attributes for those manually.

 * Page Number Removal - originally, I thought I could just use a regex
 for an all-numeric value on a single physical line, but each document
 does page numbering slightly differently, and it's very difficult to
 get rid of these without manually proofreading each page.

 * Footnotes - handling these where the note and the reference both appear
 on the same page is hard enough, but doing it when they span different
 (even consecutive) pages is worse.

More Repositories

1

simple-graph

This is a simple graph database in SQLite, inspired by "SQLite as a document database"
1,382
star
2

tweet-secret

This is a text steganography application optimized for use on Twitter, written in Clojure.
Clojure
183
star
3

go-recaptcha

A package for handling reCaptcha (http://www.google.com/recaptcha) form submissions in Go (http://golang.org/).
Go
117
star
4

go-statemachine

An implementation of a finite state machine in Go
Go
106
star
5

recipebook

This is a simple application for scraping and parsing food recipe data found on the web in hRecipe format, producing results in json
Python
104
star
6

recipes

A collection of cooking recipes in json format
79
star
7

go-api

This package provides a framework for creating HTTP servers in Go (http://golang.org/) to handle API requests capable of replying in xml, json, or any other valid content type.
Go
75
star
8

buckabuckaboo

An unobtrusive, cross-browser javascript plugin for tracking mouse movement on web pages
JavaScript
50
star
9

CleanScrape

A no-nonsense web scraping tool which removes the crap and preserves the content in epub and pdf formats.
Python
41
star
10

cmdline-news

This is a simple command-line based rss reader which is great for browsing your favorite sites unobtrusively, without having to open a browser window.
Python
27
star
11

simple-graph-pypi

This is the meta repository for packaging the simple-graph implementation in python for PyPI distribution
Python
25
star
12

go-one-password

A password generator for website logins based on a single, private passphrase. This is a self-contained, statically compiled application which runs on the command line or as a simple gui, and does not require an internet connection.
Go
22
star
13

simple-graph-go

This is the Go implementation of simple-graph (https://github.com/dpapathanasiou/simple-graph)
Go
16
star
14

intelligent-smtp-responder

This is an intelligent email-based agent server
Python
13
star
15

go-tree-notation

This is a Tree Notation library implemented in Go
Go
7
star
16

MyTeX

My LaTeX templates for personal correspondence and other documents
TeX
5
star
17

zen-thought

This is a zen thought-of-the-day aphorism application based on the daily paper calendar by JoTaiga, and modeled after the old unix fortune application.
C
5
star
18

nihongo-benkyou

This is Japanese language study: various notes and translations in easy to parse formats
4
star
19

python-recaptcha

This code handles reCaptcha form submissions in Python
Python
2
star
20

concept-catalog

This is a proof-of-concept in defining a catalog of software concepts as described in "The Essence of Software (EOS)"
Alloy
2
star
21

ARMS

Another RESTful Mongo Service
Kotlin
2
star
22

pyBDB

pyBDB is a series of helper functions for using Berkeley DB (BDB) in python, on top of bsddb3, with support for secondary indices alongside basic key/value functions
Python
2
star
23

algorithms-unlocked-haskell

An implementation of the algorithms in "Algorithms Unlocked", in the Haskell programming language, as a learning exercise.
Haskell
1
star
24

sedgewick-algorithms-racket

Implementations of the algorithms defined the first edition of "Algorithms in C++" by Robert Sedgewick but in racket
Racket
1
star