• Stars
    star
    435
  • Rank 99,233 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created about 9 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A multilingual command line sentence tokenizer in Golang

release GODOC MIT Go Report Card

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Features

  • Supports multiple languages (english, czech, dutch, estonian, finnish, german, greek, italian, norwegian, polish, portuguese, slovene, and turkish)
  • Zero dependencies
  • Extendable
  • Fast

Install

arch

aur

mac

brew tap neurosnap/sentences
brew install sentences

other

Or you can find the pre-built binaries on the github releases page.

using golang

go get github.com/neurosnap/sentences
go install github.com/neurosnap/sentences/cmd/sentences

Command

Command line

Get it

go get github.com/neurosnap/sentences

Use it

import (
    "fmt"
    "os"

    "github.com/neurosnap/sentences"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // download the training data from this repo (./data) and save it somewhere
    b, _ := os.ReadFile("./path/to/english.json")

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "github.com/neurosnap/sentences/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golden-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customize

sentences was built around composability, most major components of this package can be extended.

Eager to make ad-hoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library Avg Speed (s, 10 runs) Accuracy (%)
Sentences 1.96 98.95
NLTK 5.22 99.21

More Repositories

1

neovimcraft

website that makes it easy to find neovim plugins
TypeScript
319
star
2

lists.sh

a microblog for lists
Go
227
star
3

robodux

caching in redux made simple
TypeScript
101
star
4

cofx

A node and javascript library that helps developers describe side-effects as data in a declarative, flexible API.
TypeScript
94
star
5

starfx

A micro-mvc framework for react apps
TypeScript
85
star
6

nvim.sh

neovim plugin search from the terminal
Go
42
star
7

remix-middleware

express-like middleware system for your remix loaders and actions
TypeScript
36
star
8

mudicom

A python package that validates, reads, and extracts images from a DICOM file
Python
33
star
9

gen-readlines

Node.js generator-based line reader
TypeScript
25
star
10

lists-blog

the source for my blog at https://lists.sh
Makefile
23
star
11

use-cofx

declarative side-effects inside react with hooks
TypeScript
22
star
12

redux-cofx

declarative redux middleware for handling side-effects
TypeScript
12
star
13

language-csjs

CSJS syntax highlighter for Atom
CoffeeScript
12
star
14

gen-tester

Test generators with ease
JavaScript
10
star
15

deck-continuations

HTML
10
star
16

scopeify-html

Scope all CSS selectors in HTML
JavaScript
10
star
17

youhood

The neighborhood voting platform
TypeScript
9
star
18

code-nest

Indentation level of source code in popular javascript repositories on Github
JavaScript
8
star
19

react-cofx

Fetch data for a react component with a declarative side-effects library
TypeScript
7
star
20

postcss-scopeify-everything

Scopify all your CSS selectors
JavaScript
6
star
21

listifi

Create lists to share with everyone
TypeScript
6
star
22

electron-plugin-manager

Using atom-plugin-manager to install third-party plugins during runtime
JavaScript
5
star
23

redux-saga-creator

Create a fault-tolerant root saga from an object of sagas
TypeScript
5
star
24

starfx-examples

a modern approach to side-effect and state management for FE apps
TypeScript
4
star
25

redux-plugin

Something something plugins for react redux
JavaScript
3
star
26

redux-router-cofx

activate side-effects when location changes in connected-react-router
TypeScript
3
star
27

c-mysql-learning

MySQL C API, some C learnings
C
2
star
28

prose-blog

Makefile
2
star
29

redux-package-loader

Build packages for each react, redux feature
TypeScript
2
star
30

eslint-plugin-packages

Module boundary detection for local packages
JavaScript
2
star
31

redux-saga-ts

reference implementation of redux-saga using typescript
TypeScript
2
star
32

express-cofx-router

cofx router for express
TypeScript
2
star
33

generator-ts

yeoman generator for typescript libraries
JavaScript
2
star
34

redux-express-query

express-like middleware for your redux side-effects
TypeScript
1
star
35

tmp-expressjs-postgres

TypeScript
1
star
36

lint-workspaces

Linter for yarn workspaces
JavaScript
1
star
37

dcmdb-flask

dcmdb built using python, flask
Python
1
star
38

uss_api

United States of America State Information API
Python
1
star
39

dcmdb

DICOM Search Engine
JavaScript
1
star
40

vdom

playing around with my own vdom
TypeScript
1
star
41

homebrew-sentences

Tap for sentences cli
Ruby
1
star
42

sentdemo

Golang web application to demo sentence tokenization
JavaScript
1
star
43

olwizard.js

Get off my DOM!
JavaScript
1
star
44

postcss-scale-media-query

Scale media query `-width` by some percentage
JavaScript
1
star
45

neurosnap

1
star
46

async-flow-control

Demonstration of different asynchronous design patterns in javascript
HTML
1
star
47

ud_telegram_bot

Urban Dictionary Telegram Bot
JavaScript
1
star
48

rubot

idk my bff jill?
JavaScript
1
star
49

tmp-nextjs

JavaScript
1
star
50

dicom_codify

DICOM Standard 2014 Codified
Python
1
star
51

tslint-package-config

Some custom tslint rules
TypeScript
1
star