• Stars
    star
    151
  • Rank 246,057 (Top 5 %)
  • Language
    Go
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

package lingo provides the data structures and algorithms required for natural language processing

lingo

Build Status

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

Package Used For Vitality Notes Licence
gorgonia Machine learning Vital. It won't be hard to rewrite them, but why? Same author Gorgonia Licence (Apache 2.0-like)
gographviz Visualization of annotations, and other graph-related visualizations Vital for visualizations, which are a nice-to-have feature API last changed 12th April 2017 gographviz licence (Apache 2.0)
errors Errors The package won't die without it, but it's a very nice to have Stable API for the past year errors licence (MIT/BSD like)
set Set operations Can be easily replaced Stable API for the past year set licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

  • stanfordtags
  • universaltags
  • stanfordrel
  • universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

More Repositories

1

nanjingtaxi

Nanjing Taxi is a relatively secure P2P/Serverless chat system.
Go
76
star
2

math32

A float32 version of Go's math package
Go
60
star
3

hm

a simple Hindley-Milner type system in Go
Go
59
star
4

stl

package stl implements seasonal-trend decomposition by LOESS.
Go
48
star
5

the-slow-web

A Manifesto
CSS
38
star
6

economy-sim

Simple Simulation of an Economy
Jupyter Notebook
36
star
7

skiprope

package skiprope is a rope-like data structure built on top of skiplists
Go
27
star
8

sexp

package sexp provides the data structure and parser for s-expressions in Go.
Go
10
star
9

gogogadget

Inspector Gadget says... GoGoGadget
Go
9
star
10

EmbeddingZoo

Resource for word embeddings (and more)
7
star
11

Flight-Boarding-Simulation

Flight Boarding Simulation
Python
6
star
12

annotation-mode.el

annotation mode is a minor mode for annotating text documents for machine learning purposes
Emacs Lisp
4
star
13

gopherconsg2018

GopherCon Singapore 2018 demo
Go
3
star
14

SDIPUTS

Style Dictionaries for Information Presentations in UTS (UTS Harvard Style reference and in-text citations)
3
star
15

tightywhities

Go
3
star
16

dtdep

dtdep is a program that helps you discover data type dependencies within a single package. This allows you to break up packages that are too large by simply doing some graph cutting
Go
3
star
17

smallset

A fast slice-backed set for small sets in Go.
Go
1
star
18

ll

lifelike
Go
1
star
19

monad-to-gonad

JavaScript
1
star
20

InkHuffer

Printf on Steroids!
Go
1
star
21

algoritmus

algoritmus provides the Hungarian algorithm in Go. "algoritmus" is also the Hungarian word for "algorithm".
1
star
22

etaoin-shrdlu

Encrypt/Decrypt Etain-Shrdlu language
Python
1
star
23

I-Hate-Boilerplate-Code

I hate boilerplate code, so here are my checked in boilerplates for different kinds of projects I need
CSS
1
star
24

SyPyOct2012

Examples for SyPy October 2012
Python
1
star