• Stars
    star
    1,370
  • Rank 34,321 (Top 0.7 %)
  • Language
    Go
  • License
    MIT License
  • Created about 11 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

docconv

Go reference Build status Report card Sourcegraph

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go import path for this package changed to code.sajari.com/docconv.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go install code.sajari.com/docconv/docd@latest

See go help install for details on the installation location of the installed docd executable. Make sure that the full path to the executable is in your PATH environment variable.

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

    Optionally you can build it yourself:

    cd docd
    docker build -t docd .
    
  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    

Optional flags

  • addr - the bind address for the HTTP server, default is ":8888"
  • log-level
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Alternatively, via a curl:

curl -s -F input=your-file.pdf http://localhost:8888/convert

More Repositories

1

regression

Multivariable regression library in Go
Go
384
star
2

fuzzy

Spell checking and fuzzy search suggestion written in Go
Go
367
star
3

word2vec

Go library for performing computations in word2vec binary models
Go
184
star
4

storage

Go package for abstracting local, in-memory, and remote (Google Cloud Storage/S3) filesystems
Go
52
star
5

sdk-react

Official repository of the Search.io SDK for React
TypeScript
40
star
6

fastentity

Fast identification of character sequences in text or documents (multi-lingual)
Go
18
star
7

simple-linkedin-php

A fork of http://code.google.com/p/simple-linkedinphp/
PHP
10
star
8

sdk-node

Official repository of the Search.io SDK for Node.js
TypeScript
8
star
9

sdk-js

Official repository of the Search.io SDK for JavaScript integration into web applications
TypeScript
8
star
10

sdk-php

Official repository of the Search.io SDK for PHP
PHP
8
star
11

sajari-sdk-go

Search.io APIs Go Client Library
Go
4
star
12

sdk-react-guide

Examples and guides to get started with the Search.io React SDK
JavaScript
4
star
13

env

Environment variable management for services
Go
4
star
14

sdk-dotnet

Official repository of the Search.io SDK for .NET
C#
4
star
15

sdk-go

Official repository of the Search.io SDK for Go
Go
3
star
16

mlg

Generates code to a) train ML models in various languages and b) predict directly in Go
Smarty
3
star
17

community

Join to exchange ideas, ask questions, or make suggestions on how we can improve Search.io.
3
star
18

proto

Protocol Buffer Definitions for Search.io gRPC APIs
3
star
19

sdk_ruby

Official repository of the Search.io SDK for Ruby
Ruby
3
star
20

talks

Sajari presentations
Go
3
star
21

gommap

Go
2
star
22

setup-cue

Setup cuelang in your GitHub Actions workflow
2
star
23

protogen-go

Generated Go packages for Search.io gRPC APIs
Shell
1
star
24

node-sdk-scripts

A set of modules to assist with uploading data to Search.io
JavaScript
1
star
25

website-search-integration

Search.io Website Search Integration
TypeScript
1
star
26

sdk-python

Official repository of the Search.io SDK for Python
Python
1
star
27

client-sajari-service

Java
1
star