• Stars
    star
    1,575
  • Rank 29,719 (Top 0.6 %)
  • Language
    HTML
  • License
    MIT License
  • Created over 11 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

textract

A text extraction node module.

NPM NPM

Currently Extracts...

  • HTML, HTM
  • ATOM, RSS
  • Markdown
  • EPUB
  • XML, XSL
  • PDF
  • DOC, DOCX
  • ODT, OTT (experimental, feedback needed!)
  • RTF
  • XLS, XLSX, XLSB, XLSM, XLTX
  • CSV
  • ODS, OTS
  • PPTX, POTX
  • ODP, OTP
  • ODG, OTG
  • PNG, JPG, GIF
  • DXF
  • application/javascript
  • All text/* mime-types.

In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

Install

npm install textract

Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

  • PDF extraction requires pdftotext be installed, link
  • DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.
  • RTF extraction requires unrtf be installed, link, unless on OSX in which case textutil (installed by default) is used.
  • PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
  • DXF extraction requires drawingtotext be available, link

Configuration

Configuration can be passed into textract. The following configuration options are available

  • preserveLineBreaks: When using the command line this is set to true to preserve stdout readability. When using the library via node this is set to false. Pass this in as true and textract will not strip any line breaks.
  • preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default false) is set to true, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
  • exec: Some extractors (dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
  • [ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the odt extractor is what you would configure for odt and odg/odt etc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list of types for which the extractor is responsible.
  • tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex: { tesseract: { lang:"chi_sim" } }
  • tesseract.cmd: tesseract.lang allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass cmd. cmd is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and psm, you would pass { tesseract: { cmd:"-l chi_sim -psm 10" } }
  • pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword, userPassword if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract layout default so that, instead of layout: layout, it uses layout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.
  • typeOverride: Used with fromUrl, if set, rather than using the content-type from the URL request, will use the provided typeOverride.
  • includeAltText: When extracting HTML, whether or not to include alt text with the extracted text. By default this is false.

To use this configuration at the command line, prefix each open with a --.

Ex: textract image.png --tesseract.lang=deu

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.

$ textract pathToFile

Flags

Configuration flags can be passed into textract via the command line.

textract pathToFile --preserveLineBreaks false

Parameters like exec.maxBuffer can be passed as you'd expect.

textract pathToFile --exec.maxBuffer 500000

And multiple flags can be used together.

textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000

Node

Import

var textract = require('textract');

APIs

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.

error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

File
textract.fromFileWithPath(filePath, function( error, text ) {})
textract.fromFileWithPath(filePath, config, function( error, text ) {})
File + mime type
textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})
textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})
Buffer + mime type
textract.fromBufferWithMime(type, buffer, function( error, text ) {})
textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})
Buffer + file name/path
textract.fromBufferWithName(name, buffer, function( error, text ) {})
textract.fromBufferWithName(name, buffer, config, function( error, text ) {})
URL

When passing a URL, the URL can either be a string, or a node.js URL object. Using the URL object allows fine grained control over the URL being used.

textract.fromUrl(url, function( error, text ) {})
textract.fromUrl(url, config, function( error, text ) {})

Testing Notes

Running Tests on a Mac?

  • sudo port install tesseract-chi-sim
  • sudo port install tesseract-eng
  • You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
    • Go into /lib/extractors/{doc|doc-osx|rtf} and modify the code under if ( os.platform() === 'darwin' ) {. Uncommented the commented lines in these sections.

More Repositories

1

mimosa

A lightning-fast, modular, next generation browser development tool.
JavaScript
521
star
2

MimosaReactBackboneTodoList

A Mimosa skeleton integrating React, Backbone, Require.js, and Bower.
CoffeeScript
10
star
3

skelmimosa

Mimosa's skeleton module, so Mimosa projects don't have to start from scratch
JavaScript
6
star
4

mimosa-combine

A folder combining module for Mimosa
JavaScript
5
star
5

solr2solr

Copy one Solr index to another
CoffeeScript
4
star
6

mimosajs.com

Mimosa Web Site
4
star
7

mimosa-bower

Bower integration module for mimosa
JavaScript
3
star
8

mimosa-require

AMD/RequireJS module for Mimosa browser development workflow tool
JavaScript
3
star
9

mimosa-es6-module-transpiler

An ES6 module transpiler for Mimosa
JavaScript
3
star
10

mimosa-sass

A Mimosa module for SASS compiling
JavaScript
3
star
11

mimosa-server-template-compile

A mimosa module for compiling server templates as part of a build
JavaScript
3
star
12

mimosa-react

A JSX compiler for Mimosa
JavaScript
3
star
13

mimosa-lint

Linting module for Mimosa browser development workflow tool
JavaScript
3
star
14

mimosa-live-reload

Live Reload Module for Mimosa
JavaScript
3
star
15

mimosa-web-package

Web App Packaging Module for Mimosa
JavaScript
3
star
16

AngularFunMimosaCommonJS

A version of the AngularFun project using Mimosa and CommonJS
CoffeeScript
2
star
17

mimosa-import-source

Mimosa module for copying files into the project before building starts
JavaScript
2
star
18

mimosa-minify

Minifying module for Mimosa browser development workflow tool
JavaScript
2
star
19

mimosa-adhoc-module

Allows for linking node code right from your app into a mimosa workflow
JavaScript
2
star
20

hapi-route-builder

A fluid/builder pattern approach to constructing Hapi Routes
JavaScript
2
star
21

mimosa.io

Mimosa web site
CSS
2
star
22

mimosa-client-jade-static

A mimosa module to use jade to create HTML client templates
JavaScript
2
star
23

MimosaWebAppSkeleton

A basic webapp skeleton for Mimosa
CoffeeScript
2
star
24

mimosa-require-commonjs

CommonJS support for Mimosa via AMD/RequireJS
JavaScript
2
star
25

mimosa-server-reload

A Mimosa module that will restart a node server when server code changes
JavaScript
2
star
26

mimosa-server

Server module for Mimosa browser development workflow tool
JavaScript
1
star
27

MimosaJasmineBower

Sample project pulling jasmine in with bower
JavaScript
1
star
28

mimosa-emberscript

A Mimosa 2.0 module for EmberScript compiling
JavaScript
1
star
29

mimosa-esperanto-es6-modules

A es6 module transpiler module for Mimosa
JavaScript
1
star
30

mimosa-post-hook

A mimosa module to allow for execution of scripts/commands after 'mimosa watch' starts up.
JavaScript
1
star
31

engine-play

JavaScript
1
star
32

MimosaTypeScript

A simple `mimosa new` project improved for proper typescript use
CoffeeScript
1
star
33

ember-canary-plus-almond

Working through bundling issues with ember 1.7 canary
JavaScript
1
star
34

ember-redux-meetup

JavaScript
1
star
35

logmimosa

Logging module for Mimosa browser development workflow tool
JavaScript
1
star
36

mimosa-just-copy

Mimosa module to simply copy folder files without doing anything else with them.
JavaScript
1
star
37

mimosa-svgs-to-iconfonts

A Mimosa module for building a set of icon fonts out of svgs
JavaScript
1
star
38

mimosa-testem-require

A fuller featured testem module for Mimosa
JavaScript
1
star
39

mimosa-minify-svg

A Mimosa module to minify SVG files
JavaScript
1
star
40

mimosa-autoprefixer

A Mimosa module adding vendor prefixes to CSS rules
JavaScript
1
star
41

mimosa-ember-handlebars

A Mimosa 2.0 module for Handlebars template compiling for Ember apps
JavaScript
1
star
42

bower-registry-heroku-node

Personal project for company hosted bower repository
JavaScript
1
star
43

MimosaIconFontsExample

An example project showing the usage of the mimosa-svgs-to-iconfonts module
CSS
1
star
44

mimosa-coffeelint

A coffee linting module for mimosa
JavaScript
1
star
45

bower-pure

Pulling pure in with bower
CoffeeScript
1
star
46

mimosa-require-library-package

A packaging module for reusable library code
JavaScript
1
star
47

MimosaZappaSkeleton

A simple "mimosa new" skeleton with Zappa as the server technology.
CoffeeScript
1
star
48

MimosaNoRequireJS

An example of a Mimosa app that doesn't involve RequireJS
CoffeeScript
1
star
49

MimosaTestem

A sample Mimosa project with tests run via testem.
CoffeeScript
1
star
50

mimosa-html-templates

A Mimosa 2.0 module for plain HTML micro templates
JavaScript
1
star