• Stars
    star
    11,371
  • Rank 2,917 (Top 0.06 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 13 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

modest natural-language processing
compromise
modest natural language processing
npm install compromise

don't you find it strange,
    how easy text is to make,

     ᔐᖜ   and how hard it is to actually parse and use?

compromise tries its best to turn text into data.
it makes limited and sensible decisions.
it's not as smart as you'd think.

import nlp from 'compromise'

let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'

don't be fancy, at all:
if (doc.has('simon says #Verb')) {
  return true
}

grab parts of the text:
let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"

and get data:

import plg from 'compromise-speech'
nlp.extend(plg)

let doc = nlp('Milwaukee has certainly had its share of visitors..')
doc.compute('syllables')
doc.places().json()
/*
[{
  "text": "Milwaukee",
  "terms": [{ 
    "normal": "milwaukee",
    "syllables": ["mil", "wau", "kee"]
  }]
}]
*/

avoid the problems of brittle parsers:

let doc = nlp("we're not gonna take it..")

doc.has('gonna') // true
doc.has('going to') // true (implicit)

// transform
doc.contractions().expand()
dox.text()
// 'we are not going to take it..'

and whip stuff around like it's data:

let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(20)
doc.text()
// 'ninety five thousand and seventy two'

-because it actually is-

let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'

Use it on the client-side:

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp('two bottles of beer')
  doc.numbers().minus(1)
  document.body.innerHTML = doc.text()
  // 'one bottle of beer'
</script>

or likewise:

import nlp from 'compromise'

var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'

compromise is ~250kb (minified):

it's pretty fast. It can run on keypress:

it works mainly by conjugating all forms of a basic word list.

The final lexicon is ~14,000 words:

you can read more about how it works, here. it's weird.

okay -

compromise/one

A tokenizer of words, sentences, and punctuation.

import nlp from 'compromise/one'

let doc = nlp("Wayne's World, party time")
let data = doc.json()
/* [{ 
  normal:"wayne's world party time",
    terms:[{ text: "Wayne's", normal: "wayne" }, 
      ...
      ] 
  }]
*/

compromise/one splits your text up, wraps it in a handy API,

    and does nothing else -

/one is quick - most sentences take a 10th of a millisecond.

It can do ~1mb of text a second - or 10 wikipedia pages.

Infinite jest is takes 3s.

You can also parallelize, or stream text to it with compromise-speed.

compromise/two

A part-of-speech tagger, and grammar-interpreter.

import nlp from 'compromise/two'

let doc = nlp("Wayne's World, party time")
let str = doc.match('#Possessive #Noun').text()
// "Wayne's World"

compromise/two automatically calculates the very basic grammar of each word.

this is more useful than people sometimes realize.

Light grammar helps you write cleaner templates, and get closer to the information.

compromise has 83 tags, arranged in a handsome graph.

#FirstName#Person#ProperNoun#Noun

you can see the grammar of each word by running doc.debug()

you can see the reasoning for each tag with nlp.verbose('tagger').

if you prefer Penn tags, you can derive them with:

let doc = nlp('welcome thrillho')
doc.compute('penn')
doc.json()

compromise/three

Phrase and sentence tooling.

import nlp from 'compromise/three'

let doc = nlp("Wayne's World, party time")
let str = doc.people().normalize().text()
// "wayne"

compromise/three is a set of tooling to zoom into and operate on parts of a text.

.numbers() grabs all the numbers in a document, for example - and extends it with new methods, like .subtract().

When you have a phrase, or group of words, you can see additional metadata about it with .json()

let doc = nlp("four out of five dentists")
console.log(doc.fractions().json())
/*[{
    text: 'four out of five',
    terms: [ [Object], [Object], [Object], [Object] ],
    fraction: { numerator: 4, denominator: 5, decimal: 0.8 }
  }
]*/
let doc = nlp("$4.09CAD")
doc.money().json()
/*[{
    text: '$4.09CAD',
    terms: [ [Object] ],
    number: { prefix: '$', num: 4.09, suffix: 'cad'}
  }
]*/

API

Compromise/one

Output
  • .text() - return the document as text
  • .json() - return the document as data
  • .debug() - pretty-print the interpreted document
  • .out() - a named or custom output
  • .html({}) - output custom html tags for matches
  • .wrap({}) - produce custom output for document matches
Utils
  • .found [getter] - is this document empty?
  • .docs [getter] get term objects as json
  • .length [getter] - count the # of characters in the document (string length)
  • .isView [getter] - identify a compromise object
  • .compute() - run a named analysis on the document
  • .clone() - deep-copy the document, so that no references remain
  • .termList() - return a flat list of all Term objects in match
  • .cache({}) - freeze the current state of the document, for speed-purposes
  • .uncache() - un-freezes the current state of the document, so it may be transformed
Accessors
Match

(match methods use the match-syntax.)

  • .match('') - return a new Doc, with this one as a parent
  • .not('') - return all results except for this
  • .matchOne('') - return only the first match
  • .if('') - return each current phrase, only if it contains this match ('only')
  • .ifNo('') - Filter-out any current phrases that have this match ('notIf')
  • .has('') - Return a boolean if this match exists
  • .before('') - return all terms before a match, in each phrase
  • .after('') - return all terms after a match, in each phrase
  • .union() - return combined matches without duplicates
  • .intersection() - return only duplicate matches
  • .complement() - get everything not in another match
  • .settle() - remove overlaps from matches
  • .growRight('') - add any matching terms immediately after each match
  • .growLeft('') - add any matching terms immediately before each match
  • .grow('') - add any matching terms before or after each match
  • .sweep(net) - apply a series of match objects to the document
  • .splitOn('') - return a Document with three parts for every match ('splitOn')
  • .splitBefore('') - partition a phrase before each matching segment
  • .splitAfter('') - partition a phrase after each matching segment
  • .lookup([]) - quick find for an array of string matches
  • .autoFill() - create type-ahead assumptions on the document
Tag
  • .tag('') - Give all terms the given tag
  • .tagSafe('') - Only apply tag to terms if it is consistent with current tags
  • .unTag('') - Remove this term from the given terms
  • .canBe('') - return only the terms that can be this tag
Case
Whitespace
  • .pre('') - add this punctuation or whitespace before each match
  • .post('') - add this punctuation or whitespace after each match
  • .trim() - remove start and end whitespace
  • .hyphenate() - connect words with hyphen, and remove whitespace
  • .dehyphenate() - remove hyphens between words, and set whitespace
  • .toQuotations() - add quotation marks around these matches
  • .toParentheses() - add brackets around these matches
Loops
  • .map(fn) - run each phrase through a function, and create a new document
  • .forEach(fn) - run a function on each phrase, as an individual document
  • .filter(fn) - return only the phrases that return true
  • .find(fn) - return a document with only the first phrase that matches
  • .some(fn) - return true or false if there is one matching phrase
  • .random(fn) - sample a subset of the results
Insert
Transform
Lib

(these methods are on the main nlp object)

compromise/two:

Contractions

compromise/three:

Nouns
Verbs
Numbers
Sentences
Adjectives
Misc selections

.extend():

This library comes with a considerate, common-sense baseline for english grammar.

You're free to change, or lay-waste to any settings - which is the fun part actually.

the easiest part is just to suggest tags for any given words:

let myWords = {
  kermit: 'FirstName',
  fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)

or make heavier changes with a compromise-plugin.

import nlp from 'compromise'
nlp.extend({
  // add new tags
  tags: {
    Character: {
      isA: 'Person',
      notA: 'Adjective',
    },
  },
  // add or change words in the lexicon
  words: {
    kermit: 'Character',
    gonzo: 'Character',
  },
  // change inflections
  irregulars: {
    get: {
      pastTense: 'gotten',
      gerund: 'gettin'
    },
  },
  // add new methods to compromise
  api: (View) => {
    View.prototype.kermitVoice = function () {
      this.sentences().prepend('well,')
      this.match('i [(am|was)]').prepend('um,')
      return this
    }
  }
})

Docs:

gentle introduction:
Documentation:
Concepts API Plugins
Accuracy Accessors Adjectives
Caching Constructor-methods Dates
Case Contractions Export
Filesize Insert Hash
Internals Json Html
Justification Lists Keypress
Lexicon Loops Ngrams
Match-syntax Match Numbers
Performance Nouns Paragraphs
Plugins Output Scan
Projects Selections Sentences
Tagger Sorting Syllables
Tags Split Pronounce
Tokenization Text Strict
Named-Entities Utils Penn-tags
Whitespace Verbs Typeahead
World data Normalization Sweep
Fuzzy-matching Typescript Mutation
Root-forms Character Offsets
Talks:
Articles:
Some fun Applications:
Comparisons

Plugins:

These are some helpful extensions:

Dates

npm install compromise-dates

Stats

npm install compromise-stats

Speech

npm install compromise-syllables

Wikipedia

npm install compromise-wikipedia


Typescript

we're committed to typescript/deno support, both in main and in the official-plugins:

import nlp from 'compromise'
import stats from 'compromise-stats'

const nlpEx = nlp.extend(stats)

nlpEx('This is type safe!').ngrams({ min: 1 })

Limitations:

  • slash-support: We currently split slashes up as different words, like we do for hyphens. so things like this don't work: nlp('the koala eats/shoots/leaves').has('koala leaves') //false

  • inter-sentence match: By default, sentences are the top-level abstraction. Inter-sentence, or multi-sentence matches aren't supported without a plugin: nlp("that's it. Back to Winnipeg!").has('it back')//false

  • nested match syntax: the danger beauty of regex is that you can recurse indefinitely. Our match syntax is much weaker. Things like this are not (yet) possible: doc.match('(modern (major|minor))? general') complex matches must be achieved with successive .match() statements.

  • dependency parsing: Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do. We should! Help wanted with this.

FAQ

    ☂️ Isn't javascript too...

      yeah it is!
      it wasn't built to compete with NLTK, and may not fit every project.
      string processing is synchronous too, and parallelizing node processes is weird.
      See here for information about speed & performance, and here for project motivations

    💃 Can it run on my arduino-watch?

      Only if it's water-proof!
      Read quick start for running compromise in workers, mobile apps, and all sorts of funny environments.

    🌎 Compromise in other Languages?

    Partial builds?

      we do offer a tokenize-only build, which has the POS-tagger pulled-out.
      but otherwise, compromise isn't easily tree-shaken.
      the tagging methods are competitive, and greedy, so it's not recommended to pull things out.
      Note that without a full POS-tagging, the contraction-parser won't work perfectly. ((spencer's cool) vs. (spencer's house))
      It's recommended to run the library fully.

See Also:

MIT

More Repositories

1

spacetime

A lightweight javascript timezone library
JavaScript
3,822
star
2

wtf_wikipedia

a pretty-committed wikipedia markup parser
JavaScript
766
star
3

dumpster-dive

roll a wikipedia dump into mongo
JavaScript
229
star
4

unrequired

find unused javascript files in your project
JavaScript
109
star
5

Freebase.js

inference and inspection on freebase data
JavaScript
107
star
6

efrt

neato compression for key-value data
JavaScript
90
star
7

famousd3

get famo.us to render d3js components
JavaScript
27
star
8

somehow-graph

Svelte infographics component
JavaScript
27
star
9

clooney

a graphing library in the famo.us engine
JavaScript
20
star
10

timezone-soft

parse informal timezone names
JavaScript
20
star
11

out-of-character

remove invisible unicode characters
JavaScript
17
star
12

thensome

i guess we'll find out.
JavaScript
15
star
13

spacetime-geo

determine date/time using geo-location
JavaScript
11
star
14

somehow

a number of Svelte infographics
9
star
15

fit-aspect-ratio

like math? me neither!
JavaScript
8
star
16

web-pure-data-front-end

a gui for the pure-data language written for the web
JavaScript
8
star
17

slow

whoa easy there javascript
JavaScript
6
star
18

compromise-highlight

syntax-highlighting for natural language text
JavaScript
6
star
19

wikidata-freebase

helping out in the wikidata migration
JavaScript
6
star
20

spacetime-week

you thought weeks were simple. you weren't right.
JavaScript
5
star
21

sunday-driver

be cool with large files
JavaScript
5
star
22

spacetime-ticks

calculate some sensible break-points between two dates
JavaScript
5
star
23

front_yard

where is the semantic web, if it's not out in front of your own house.
5
star
24

table-turn

html-table parser on the command line
JavaScript
4
star
25

simple_english

simplify natural language english in javascript
JavaScript
4
star
26

somehow-maps

make a map without thinking
JavaScript
4
star
27

compromise-align

generate html aligned by specific text matches
JavaScript
4
star
28

dumpster-dip

parse a wikipedia dump into tiny files
JavaScript
3
star
29

suffix-thumb

find the optimal transformations between words
JavaScript
3
star
30

somehow-calendar

calendar visualization
JavaScript
3
star
31

spacetime-daylight

calculate sunlight exposure for a given timezone
JavaScript
3
star
32

git-slop

cleaner git commands
JavaScript
3
star
33

somehow-circle

an easy way to make radial infographics
JavaScript
3
star
34

townhouse

a new-tab page with browsing history
CoffeeScript
3
star
35

a_wall_map

make a large satelite image to print as a wall map
JavaScript
2
star
36

NLP-OSS-2020

Talk for EMNLP 2020
JavaScript
2
star
37

scratch

first toil, then grave
JavaScript
2
star
38

wtf-plugin-nsfw

WIP content classifier for wikipedia articles
JavaScript
2
star
39

amble

a watch script for cleaner development
JavaScript
2
star
40

wiki-summary

generate configurable-length descriptions from wikipedia articles
JavaScript
2
star
41

wrestlejs

a gui for goofing around with json files
JavaScript
2
star
42

Dirty.js

do questionable things to the js built-in methods
JavaScript
2
star
43

wtf-plugin-mlb

parse baseball game data from wikipedia
JavaScript
2
star
44

crop-aspect

crop an image by a nearby aspect-ratio
JavaScript
2
star
45

nhl_scrape

scrape nhl.com schedule data into json
JavaScript
2
star
46

wtf-plugin-nhl

parse NHL data from wikipedia
JavaScript
2
star
47

Mount-Heavy

an android application to see pictures of nearby people that are no longer living
1
star
48

bitbar

widgets for matryer/bitbar
1
star
49

Spencer-bookmarklets

bookmarklets by spencer
1
star
50

Spencer-is-also-ubiquituous

ubiquity commands by spencer
1
star
51

garbage-patch

not the smartest json-patch implementation
JavaScript
1
star
52

somehow-timeline

a svelte component for layout with time as y-axis
JavaScript
1
star
53

spencers-chrome-extensions

chrome extensions by spencer
JavaScript
1
star
54

freebase_garden

reasonable workflows for getting wikipedia data into freebase
JavaScript
1
star
55

osm_yeah

a workflow for getting openstreetmap data presented into a browser
HTML
1
star
56

spencer-css

some tachyons-inspired css classes
CSS
1
star
57

spencermountain.github.io

yes sir, i do
JavaScript
1
star
58

somehow-ticks

generate nice axis-markings between two arbitrary numbers
JavaScript
1
star
59

somehow-script

a natural-language data-entry format
JavaScript
1
star
60

scal

modern UNIX cal command
JavaScript
1
star
61

somehow-sankey

WIP svelte sankey diagram component
JavaScript
1
star