There are no reviews yet. Be the first to send feedback to the community and the maintainers!
PDFMiner (http://www.unixuser.org/~euske/python/pdfminer/index.html) is a pdf parsing library written in Python by Yusuke Shinyama. In addition to the pdf2txt.py and dumppdf.py command line tools, there is a way of analyzing the content tree of each page programmatically. This is a more complete example of programming with PDFMiner, which continues where the default documentation (http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout) stops. This code is still a work-in-progress, with room for improvement. Usage: import layout_scanner and call get_toc() for a list of the table of contents, and get_pages() for the full text. Here are some examples using the Python shell: >>> import layout_scanner >>> toc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf') >>> len(toc) ... should return the number of elements in the pdf document's table of contents (or 0 if there is no TOC) >>> toc[0] ... a tuple containing the ordinal sequence and the title string, for example: (1, u'Introduction') >>> pages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf') >>> len(pages) ... should return the number of pages in the pdf document >>> pages[0] ... a string of all the text on the first page Room for Improvement * Column Merging - while the fuzzy heuristic I described works well for the pdf files I've parsed so far, I can imagine more complex documents where it would break-down (perhaps this is where the analysis should be more sophisticated, and not ignore so many types of pdfminer.layout.LT* objects). * Image Extraction - I'd like to be able to be at least as good as pdftoimages, and save every file in ppm or pnm default format, but I'm not sure what I could be doing differently * Title and Heading Capitalization - this seems to be an issue with PDFMiner, since I get similar results in using the command line tools, but it is annoying to have to go back and fix all the mis-capitalizations manually, particularly for larger documents. * Title and Heading Fonts and Spacing - a related issue, though probably something in my own code, is that those same title and paragraph headings aren't distinguished from the rest of the text. In many cases, I have to go back and add vertical spacing and font attributes for those manually. * Page Number Removal - originally, I thought I could just use a regex for an all-numeric value on a single physical line, but each document does page numbering slightly differently, and it's very difficult to get rid of these without manually proofreading each page. * Footnotes - handling these where the note and the reference both appear on the same page is hard enough, but doing it when they span different (even consecutive) pages is worse.
simple-graph
This is a simple graph database in SQLite, inspired by "SQLite as a document database"tweet-secret
This is a text steganography application optimized for use on Twitter, written in Clojure.go-recaptcha
A package for handling reCaptcha (http://www.google.com/recaptcha) form submissions in Go (http://golang.org/).go-statemachine
An implementation of a finite state machine in Gorecipebook
This is a simple application for scraping and parsing food recipe data found on the web in hRecipe format, producing results in jsonrecipes
A collection of cooking recipes in json formatgo-api
This package provides a framework for creating HTTP servers in Go (http://golang.org/) to handle API requests capable of replying in xml, json, or any other valid content type.buckabuckaboo
An unobtrusive, cross-browser javascript plugin for tracking mouse movement on web pagesCleanScrape
A no-nonsense web scraping tool which removes the crap and preserves the content in epub and pdf formats.cmdline-news
This is a simple command-line based rss reader which is great for browsing your favorite sites unobtrusively, without having to open a browser window.simple-graph-pypi
This is the meta repository for packaging the simple-graph implementation in python for PyPI distributiongo-one-password
A password generator for website logins based on a single, private passphrase. This is a self-contained, statically compiled application which runs on the command line or as a simple gui, and does not require an internet connection.simple-graph-go
This is the Go implementation of simple-graph (https://github.com/dpapathanasiou/simple-graph)intelligent-smtp-responder
This is an intelligent email-based agent servergo-tree-notation
This is a Tree Notation library implemented in GoMyTeX
My LaTeX templates for personal correspondence and other documentszen-thought
This is a zen thought-of-the-day aphorism application based on the daily paper calendar by JoTaiga, and modeled after the old unix fortune application.nihongo-benkyou
This is Japanese language study: various notes and translations in easy to parse formatspython-recaptcha
This code handles reCaptcha form submissions in Pythonconcept-catalog
This is a proof-of-concept in defining a catalog of software concepts as described in "The Essence of Software (EOS)"ARMS
Another RESTful Mongo ServicepyBDB
pyBDB is a series of helper functions for using Berkeley DB (BDB) in python, on top of bsddb3, with support for secondary indices alongside basic key/value functionsalgorithms-unlocked-haskell
An implementation of the algorithms in "Algorithms Unlocked", in the Haskell programming language, as a learning exercise.sedgewick-algorithms-racket
Implementations of the algorithms defined the first edition of "Algorithms in C++" by Robert Sedgewick but in racketLove Open Source and this site? Check out how you can help us