• Stars
    star
    105
  • Rank 326,358 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 12 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python library for extracting text from various file formats (for indexing).
Linux tests (Travis)
Windows tests (Appveyor)
SmartFile

A SmartFile Open Source project.

Introduction

Fulltext extracts texts from various document formats. It can be used as the first part of search indexing, document analysis etc.

Fulltext differs from other libraries in that it tries to use file data in the form it is given. For most backends, a file-like object or path can be handled directly, removing the need to write temporary files.

Fulltext uses native python libraries when possible and utilizes third party Python libraries and CLI tools when necessary, for example, the following (but not only) CLI tools are utilized.

  • antiword - Legacy .doc (Word) format.
  • unrtf - .rtf format.
  • pdf2text (apt install poppler-utils) - .pdf format.
  • pstotext (apt install pstotext) - .ps format.
  • tesseract-ocr - image formats (OCR).
  • abiword - office documents.

Supported formats

Extension Linux Windows
bin python stdlib python stdlib
bmp tesseract CLI and pytesserac module  
csv python csv module python csv module
doc antiword CLI tool  
docx docx2txt module docx2txt module
eml email module email module
epub ebooklib module ebooklib module
gif tesseract CLI and pytesserac module  
gz python gzip module python gzip module
html BeautifulSoup module BeautifulSoup module
hwp pyhwp module as CLI tool  
jpg tesseract CLI and pytesserac module  
json json module json module
mbox mailbox module mailbox modul
msg msg-extractor module  
ods lxml, zipfile modules lxml, zipfile modules
odt lxml, zipfile modules lxml, zipfile modules
pdf pdf2text CLI tool pdf2text CLI tool
png tesseract CLI and pytesserac module  
pptx pptx module  
ps pstotext CLI tool  
psv python csv module python csv module
rar rarfile module rarfile module
rtf unrtf CLI tool unrtf CLI tool
text python stdlib python stdlib
tsv python csv module python csv module
xls xlrd module xlrd module
xlsx xlrd module xlrd module
xml lxml module lxml module
zip zipfile module zipfile module

Supported title formats

Other than extracting text fulltext lib is able to determine title for certain file extensions:

Extension Linux Windows
doc exiftool CLI tool  
docx exiftool CLI tool exiftool CLI tool
epub exiftool CLI tool  
html BeautifulSoup module BeautifulSoup module
odt exiftool CLI tool exiftool CLI tool
pdf pdfinfo CLI tool  
pptx pdfinfo CLI tool  
ps exiftool CLI tool  
rtf exiftool CLI tool  
xls exiftool CLI tool exiftool CLI tool
xlsx exiftool CLI tool exiftool CLI tool

Installing tools

Fulltext uses a number of pure Python libraries. Fulltext also uses the command line tools: antiword, pdf2text and unrtf. To install the required libraries and CLI tools, you can use your package manager.

$ sudo yum install antiword abiword unrtf poppler-utils libjpeg-dev \
tesseract-ocr pstotext

Or for debian-based systems:

$ sudo apt-get install antiword abiword unrtf poppler-utils libjpeg-dev \
pstotext

Usage

Fulltext uses a simple dictionary-style interface. A single public function fulltext.get() is provided. This function takes an optional default parameter which when supplied will supress errors and return that default if text could not be extracted.

>>> import fulltext
>>>
>>> fulltext.get('does-not-exist.pdf', None)
None
>>> fulltext.get('exists.pdf', None)
'Lorem ipsum...'

You can pass a file-like object or a path to .get() Fulltext will try to do the right thing, using memory buffers or temp files depending on the backend.

You should pass any file details you have available, such as the file name or mime type. These will help fulltext select the correct backend. If you want to specify the backend explicitly, use the backend keyword argument.

>>> with open('foo.pdf' 'rb') as f:
...     fulltext.get(f, name='foo.pdf', mime='application/pdf',
...                  backend='pdf')

Some backends accept additonal parameters. You can pass these using the kwargs key word argument.

>>> fulltext.get('foo.pdf', kwargs={'option': 'value'})

You can also get the title for certain file formats:

>>> fulltext.get_with_title('foo.pdf')
('file content', 'file title')

You can specify the encoding to use (defaults to sys.getfilesystemencoding() + strict error handler):

>>> fulltext.get('foo.pdf', encoding='latin1', encoding_errors='ignore')

Custom backends

To write a new backend, you need to do two things. First, create a python module within a Backend class that implements the interface that Fulltext expects. Second, register the new backend against fulltext.

import fulltext
from fulltext.util import BaseBackend


fulltext.register_backend(
    'application/x-rar-compressed',
    'path.to.this.module',
    ['.rar'])


class Backend(BaseBackend):

    def check(title):
        # This is invoked before `handle_` functions. In here you can
        # import third party deps or raise an exception if a CLI tool
        # is missing. Both conditions will be turned into a warning
        # on `get()` and bin backend will be used as fallback.
        pass

    def setup():
        # This is called before `handle_` functions.
        pass

    def teardown():
        # This is called after `handle_` functions, also in case of error.
        pass

    def handle_fobj(f, **kwargs):
        # Extract text from a file-like object. This should be defined when
        # possible.

        # These are the available instance attributes passed to `get()`
        # function.
        self.mime
        self.encoding
        self.encoding_errors
        self.kwargs

    def handle_path(path, **kwargs):
        # Extract text from a path. This should only be defined if it can be
        # done more efficiently than having Python open() and read() the file,
        # passing it to handle_fobj().
        pass

    def handle_title(file_or_path):
        # Extract title
        pass

If you only implement handle_fobj() Fulltext will open any paths and pass them to that function. Therefore if possible, define at least this method. If working with file-like objects is not possible and you only define handle_path() then Fulltext will save any file-like objects to a temporary file and use that function. Sometimes it is advantageous to define both functions in cases when you can do each efficiently.

If you have questions about writing a backend, see the `./backends/`_ directory for some examples.

More Repositories

1

thumbnailer

Creates thumbnails of office documents (.docx, .odt, .ppt, .pdf) and images.
Python
83
star
2

py-radius

RADIUS authentication module
Python
62
star
3

everwary

IP Camera management / security solution
Python
22
star
4

fs-dropbox

File system for pyFilesystem which uses the dropbox API.
Python
9
star
5

django-k8s

Integration for Django and Kubernetes or alternate Container Orchestration.
Python
8
star
6

sphinx-haystack

Sphinx RT Index backend for Haystack
Python
8
star
7

fs-boxnet

File system for pyFilesystem which uses the box.net API.
Python
4
star
8

reloadconf

Simple process manager that handles configuration changes.
Python
4
star
9

tpq

Trivial Postgres Queue
Python
4
star
10

uwsgi-chunked

WSGI application wrapper that handles Transfer-Encoding: chunked
Python
4
star
11

ug

Web Underground browser-based website hosting platform.
JavaScript
4
star
12

cloudstrype

Personal cloud
Python
3
star
13

nimtok

IPFS / Nimiq messaging
Vue
3
star
14

transfers

FTP for humans.
Python
2
star
15

chains

JavaScript library for building chains for HTML.
JavaScript
2
star
16

django-homerun

Django CMS for Real Estate listings
Python
2
star
17

sqldiff

Python tool to compare two MySQL .sql files.
Python
2
star
18

cesium

WebOS video player
Vue
2
star
19

vidsrc

HTML scaper that enumerates video:src attributes.
JavaScript
2
star
20

officer-pdf

Soffice container / REST API.
Python
1
star
21

contactology

Contactology Python API client (with improvements)
Python
1
star
22

django-proxysql

Django database engine that handles database server failure and recovery.
Python
1
star
23

Inspect.bundle

A way to explore the undocumented Plex Media Server plugin API
Python
1
star
24

proxyjs

Node.js proxy server using HAProxy proxy protocol
JavaScript
1
star
25

stegtext

Steganography
JavaScript
1
star
26

django-tpq

Django Trivial Postgres Queue
Python
1
star
27

python-xlistdir

Return directory listings using an iterator. Useful for paging directory listings or very large directories.
1
star