• Stars
    star
    747
  • Rank 58,420 (Top 2 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created about 14 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Natural Language Processing in Clojure (opennlp)

Clojure library interface to OpenNLP - https://opennlp.apache.org/

A library to interface with the OpenNLP (Open Natural Language Processing) library of functions. Not all functions are implemented yet.

Additional information/documentation:

Read the source from Marginalia

Continuous Integration status

Known Issues

  • When using the treebank-chunker on a sentence, please ensure you have a period at the end of the sentence, if you do not have a period, the chunker gets confused and drops the last word. Besides, your sentences should all be grammactially correct anyway right?

Usage from Leiningen:

[clojure-opennlp "0.5.0"] ;; uses Opennlp 1.9.0

clojure-opennlp works with clojure 1.5+

Basic Example usage (from a REPL):

(use 'clojure.pprint) ; just for this documentation
(use 'opennlp.nlp)
(use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here

You will need to make the processing functions using the model files. These assume you're running from the root project directory. You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models-1.5

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def detokenize (make-detokenizer "models/english-detokenizer.xml"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def name-find (make-name-finder "models/namefind/en-ner-person.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

The tool-creators are multimethods, so you can also create any of the tools using a model instead of a filename (you can create a model with the training tools in src/opennlp/tools/train.clj):

(def tokenize (make-tokenizer my-tokenizer-model)) ;; etc, etc

Then, use the functions you've created to perform operations on text:

Detecting sentences:

(pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea..."))
["First sentence. ", "Second sentence? ", "Here is another one. ",
 "And so on and so forth - you get the idea..."]

Tokenizing:

(pprint (tokenize "Mr. Smith gave a car to his son on Friday"))
["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",
 "Friday"]

Detokenizing:

(detokenize ["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on", "Friday"])
"Mr. Smith gave a car to his son on Friday."

Ideally, s == (detokenize (tokenize s)), the detokenization model XML file is a work in progress, please let me know if you run into something that doesn't detokenize correctly in English.

Part-of-speech tagging:

(pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))
(["Mr." "NNP"]
 ["Smith" "NNP"]
 ["gave" "VBD"]
 ["a" "DT"]
 ["car" "NN"]
 ["to" "TO"]
 ["his" "PRP$"]
 ["son" "NN"]
 ["on" "IN"]
 ["Friday." "NNP"])

Name finding:

(name-find (tokenize "My name is Lee, not John."))
("Lee" "John")

Treebank-chunking splits and tags phrases from a pos-tagged sentence. A notable difference is that it returns a list of structs with the :phrase and :tag keys, as seen below:

(pprint (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
({:phrase ["The" "override" "system"], :tag "NP"}
 {:phrase ["is" "meant" "to" "deactivate"], :tag "VP"}
 {:phrase ["the" "accelerator"], :tag "NP"}
 {:phrase ["when"], :tag "ADVP"}
 {:phrase ["the" "brake" "pedal"], :tag "NP"}
 {:phrase ["is" "pressed"], :tag "VP"})

For just the phrases:

(phrases (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
(["The" "override" "system"] ["is" "meant" "to" "deactivate"] ["the" "accelerator"] ["when"] ["the" "brake" "pedal"] ["is" "pressed"])

And with just strings:

(phrase-strings (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
("The override system" "is meant to deactivate" "the accelerator" "when" "the brake pedal" "is pressed")

Document Categorization:

See opennlp.test.tools.train for better usage examples.

(def doccat (make-document-categorizer "my-doccat-model"))

(doccat "This is some good text")
"Happy"

Probabilities of confidence

The probabilities OpenNLP supplies for a given operation are available as metadata on the result, where applicable:

(meta (get-sentences "This is a sentence. This is also one."))
{:probabilities (0.9999054310803004 0.9941126097177366)}

(meta (tokenize "This is a sentence."))
{:probabilities (1.0 1.0 1.0 0.9956236737394807 1.0)}

(meta (pos-tag ["This" "is" "a" "sentence" "."]))
{:probabilities (0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769)}

(meta (chunker (pos-tag ["This" "is" "a" "sentence" "."])))
{:probabilities (0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069)}

(meta (name-find ["My" "name" "is" "John"]))
{:probabilities (0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192)}

Beam Size

You can rebind opennlp.nlp/*beam-size* (the default is 3) for the pos-tagger and treebank-parser with:

(binding [*beam-size* 1]
  (def pos-tag (make-pos-tagger "models/en-pos-maxent.bin")))

Advance Percentage

You can rebind opennlp.treebank/*advance-percentage* (the default is 0.95) for the treebank-parser with:

(binding [*advance-percentage* 0.80]
  (def parser (make-treebank-parser "parser-model/en-parser-chunking.bin")))

Treebank-parsing

Note: Treebank parsing is very memory intensive, make sure your JVM has a sufficient amount of memory available (using something like -Xmx512m) or you will run out of heap space when using a treebank parser.

Treebank parsing gets its own section due to how complex it is.

Note none of the treebank-parser model is included in the git repo, you will have to download it separately from the opennlp project.

Creating it:

(def treebank-parser (make-treebank-parser "parser-model/en-parser-chunking.bin"))

To use the treebank-parser, pass an array of sentences with their tokens separated by whitespace (preferably using tokenize)

(treebank-parser ["This is a sentence ."])
["(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))"]

In order to transform the treebank-parser string into something a little easier for Clojure to perform on, use the (make-tree ...) function:

(make-tree (first (treebank-parser ["This is a sentence ."])))
{:chunk {:chunk ({:chunk {:chunk "This", :tag DT}, :tag NP} {:chunk ({:chunk "is", :tag VBZ} {:chunk ({:chunk "a", :tag DT} {:chunk "sentence", :tag NN}), :tag NP}), :tag VP} {:chunk ".", :tag .}), :tag S}, :tag TOP}

Here's the datastructure split into a little more readable format:

{:tag TOP
 :chunk {:tag S
         :chunk ({:tag NP
                  :chunk {:tag DT
                          :chunk "This"}}
                 {:tag VP
                  :chunk ({:tag VBZ
                           :chunk "is"}
                          {:tag NP
                           :chunk ({:tag DT
                                    :chunk "a"}
                                   {:tag NN
                                    :chunk "sentence"})})}
                 {:tag .
                  :chunk "."})}}

Hopefully that makes it a little bit clearer, a nested map. If anyone else has any suggesstions for better ways to represent this information, feel free to send me an email or a patch.

Treebank parsing is considered beta at this point.

Filters

Filtering pos-tagged sequences

(use 'opennlp.tools.filters)

(pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["Mr." "NNP"]
 ["Smith" "NNP"]
 ["car" "NN"]
 ["son" "NN"]
 ["Friday" "NNP"])

(pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["gave" "VBD"])

Filtering treebank-chunks

(use 'opennlp.tools.filters)

(pprint (noun-phrases (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed")))))
({:phrase ["The" "override" "system"], :tag "NP"}
 {:phrase ["the" "accelerator"], :tag "NP"}
 {:phrase ["the" "brake" "pedal"], :tag "NP"})

Creating your own filters:

(pos-filter determiners #"^DT")
#'user/determiners
(doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
  Given a list of pos-tagged elements, return only the determiners in a list.

(pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["a" "DT"])

You can also create treebank-chunk filters using (chunk-filter ...)

(chunk-filter fragments #"^FRAG$")

(doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
  Given a list of treebank-chunked elements, return only the fragments in a list.

Being Lazy

There are some methods to help you be lazy when tagging methods, depending on the operation desired, use the corresponding method:

#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq

Here's how to use them:

(use 'opennlp.nlp)
(use 'opennlp.treebank)
(use 'opennlp.tools.lazy)

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

(lazy-get-sentences ["This body of text has three sentences. This is the first. This is the third." "This body has only two. Here's the last one."] get-sentences)
;; will lazily return:
(["This body of text has three sentences. " "This is the first. " "This is the third."] ["This body has only two. " "Here's the last one."])

(lazy-tokenize ["This is a sentence." "This is another sentence." "This is the third."] tokenize)
;; will lazily return:
(["This" "is" "a" "sentence" "."] ["This" "is" "another" "sentence" "."] ["This" "is" "the" "third" "."])

(lazy-tag ["This is a sentence." "This is another sentence."] tokenize pos-tag)
;; will lazily return:
((["This" "DT"] ["is" "VBZ"] ["a" "DT"] ["sentence" "NN"] ["." "."]) (["This" "DT"] ["is" "VBZ"] ["another" "DT"] ["sentence" "NN"] ["." "."]))

(lazy-chunk ["This is a sentence." "This is another sentence."] tokenize pos-tag chunker)
;; will lazily return:
(({:phrase ["This"], :tag "NP"} {:phrase ["is"], :tag "VP"} {:phrase ["a" "sentence"], :tag "NP"}) ({:phrase ["This"], :tag "NP"} {:phrase ["is"], :tag "VP"} {:phrase ["another" "sentence"], :tag "NP"}))

Feel free to use the lazy functions, but I'm still not 100% set on the layout, so they may change in the future. (Maybe chaining them so instead of a sequence of sentences it looks like (lazy-chunk (lazy-tag (lazy-tokenize (lazy-get-sentences ...))))).

Generating a lazy sequence of sentences from a file using opennlp.tools.lazy/sentence-seq:

(with-open [rdr (clojure.java.io/reader "/tmp/bigfile")]
  (let [sentences (sentence-seq rdr get-sentences)]
    ;; process your lazy seq of sentences however you desire
    (println "first 5 sentences:")
    (clojure.pprint/pprint (take 5 sentences))))

Training

There is code to allow for training models for each of the tools. Please see the documentation in TRAINING.markdown

License

Copyright (C) 2010 Matthew Lee Hinman

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.

Contributors

  • Rob Zinkov - zaxtax
  • Alexandre Patry - apatry

TODO

  • add method to generate lazy sequence of sentences from a file (done!)
  • Detokenizer (still more work to do, but it works for now)
  • Do something with parse-num for treebank parsing
  • Split up treebank stuff into its own namespace (done!)
  • Treebank chunker (done!)
  • Treebank parser (done!)
  • Laziness (done! for now.)
  • Treebank linker (WIP)
  • Phrase helpers for chunker (done!)
  • Figure out what license to use. (done!)
  • Filters for treebank-parser
  • Return multiple probability results for treebank-parser
  • Explore including probability numbers (probability numbers added as metadata)
  • Model training/trainer (done!)
  • Revisit datastructure format for tagged sentences
  • Document beam-size functionality
  • Document advance-percentage functionality
  • Build a full test suite: -- core tools (done) -- filters (done) -- laziness (done) -- training (pretty much done except for tagging)

More Repositories

1

clj-http

An idiomatic clojure http client wrapping the apache client. Officially supported version.
Clojure
1,755
star
2

cheshire

Clojure JSON and JSON SMILE (binary json format) encoding/decoding
Clojure
1,469
star
3

elasticsearch-in-action

Offical code repository for the Elasticsearch in Action book from Manning
Shell
370
star
4

eos

Welcome to the Emacs of Things, aka the Emacs Operating System
Emacs Lisp
261
star
5

es-mode

An Emacs major mode for interacting with Elasticsearch
Emacs Lisp
192
star
6

itsy

A threaded web-spider written in Clojure
Clojure
180
star
7

lein-bikeshed

A Leiningen plugin designed to tell you your code is bad, and that you should feel bad
Clojure
177
star
8

ox-tufte

Emacs' Org-mode export backend for Tufte HTML
Emacs Lisp
98
star
9

dakrone-dotfiles

misc configuration files
Vim Script
89
star
10

cld

Language detection for Clojure
Clojure
42
star
11

emacs-java-imports

Add java imports easily in Emacs
Emacs Lisp
40
star
12

clojuredocs-client

A tiny client for the http://clojuredocs.org API
Clojure
36
star
13

ricepaper

Simple library and CLI tool for adding URLs to InstaPaper
Ruby
35
star
14

nsm-console

Network Security Monitoring Console
Ruby
23
star
15

forkify

Do work from a pool of processes using forks. Like threadify with processes.
Ruby
18
star
16

lein-autotest

A Leiningen plugin to start Lazytest's autowatch
Clojure
17
star
17

fastri

Fastri, now with 1.9 support
Ruby
13
star
18

cadastre

Survey a clojure project and extract valuable metadata
Clojure
13
star
19

dakrone-theme

dakrone's custom emacs color theme
Emacs Lisp
11
star
20

dakrone-light-theme

Dakrone's custom light Emacs theme
Emacs Lisp
11
star
21

dakrone.github.com

webness
HTML
10
star
22

eisago

Next-gen clojuredocs importer, API, and website.
Clojure
9
star
23

syndicate

Fun with NLP and synonyms.
Clojure
9
star
24

tigris

Stream-to-stream JSON string escaping
Java
7
star
25

lein-clojuredocs

Generate data about your project to submit to clojuredocs.org
Clojure
7
star
26

lids

Locality Intrusion Detection System
C++
6
star
27

elasticsearch-clojure-plugin

A proof-of-concept Elasticsearch plugin written entirely in Clojure
Clojure
6
star
28

denverclojure

Denver Clojure meetup website
Clojure
5
star
29

rcapr

A Ruby library to interact with the pcapr website (http://pcapr.net)
Ruby
4
star
30

integrity-cap-notifier

Notifier for integrity that does capistrano deployments when a build passes
Ruby
4
star
31

clomoios

Context searching using NLP magic
Clojure
4
star
32

one-offs

Simple-file projects, one-offs and miscellaneous stuff.
Ruby
3
star
33

skyyy

A small ruby script to deal with Skype over dbus
Ruby
3
star
34

ruby-datasuite

A simple interactive/scriptable random data generation and verification test tool.
3
star
35

cbench

A small clojure benchmarking helper
Clojure
3
star
36

nile

Stream utilities for everyday Clojure use
Clojure
3
star
37

labview_rails

A lab machine tracking system using rails (for sysadmins)
Ruby
3
star
38

biblybot

Bibly bot is a Google Wave robot for whatever I feel like
Clojure
3
star
39

corpus

a tool used to train a detokenization library
Clojure
2
star
40

churn

Churn data given a directory and change rate
Java
2
star
41

jsontest

testing speeds of clojure json libs
Clojure
2
star
42

elasticsearch-nrepl

Embedded nREPL in ElasticSearch
Java
2
star
43

felix

A handy garbage monitor for monitoring when clojure objects are GC'd
Clojure
2
star
44

clj-http-async

clj-http, but with Apache's async client instead
Clojure
2
star
45

metube

Youtube downloading as an immutant service
Clojure
2
star
46

clj2010

Stats from #clojure, forked from https://bitbucket.org/tebeka/clj2010/overview
1
star
47

norad

SQS Message consumption for Immutant queues
Clojure
1
star
48

download-test

don't look
1
star
49

cd-client

clojuredocs API client
1
star
50

chrojos

a clojure library for parsing irrational datetime strings
Clojure
1
star
51

syn

Don't look at me yet
Clojure
1
star
52

gh-upload

Github file uploader
Clojure
1
star
53

recipi

Personal web server for storing recipes
Clojure
1
star
54

chunktest

test
Clojure
1
star
55

screamy

Immutant queue notifications
Clojure
1
star
56

org-criterium

Work in progress
Clojure
1
star
57

ars-capture

Periodically capture adaptive replica stats and store them in a different ES cluster
Clojure
1
star