• Stars
    star
    241
  • Rank 167,643 (Top 4 %)
  • Language
    JavaScript
  • License
    GNU General Publi...
  • Created over 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js

salient

salient is a natural language processing and machine learning toolkit. Salient contains many common tasks from sentiment analysis, part of speech tagging, tokenization, neural networks, regression analysis, wiktionary parsing, logistic regression, language modeling, mphf, radix trees, vocabulary building and the potential for more awesomeness. It can be used for many classification tasks, categorization, and many common text processing tasks all in node.js :D

npm install salient

Features

  • Sentiment Analysis (includes negation semantics, amplifiers, sarcasm detection)
  • Various Machine Learning Algorithms with regularization, cross validation, learning curves, normalization
  • Neural Networks (Feedforward N-Layers with Y-Units)
  • Logistic Regression
  • Linear Regression
  • Language Modeling (minimal perfect hashing, gaddag, radix trees, tries)
  • Hidden Markov Models
  • Tri-gram HMM Part of Speech Tagger
  • Wiktionary Parsing
  • Tokenization (Tweets, URLS, Word Puncutation, RegExp, Emoticons, Wiki Articles, HTML)
  • Corpus Parsing / Building
  • Brown Corpus Parsing / Analyzer
  • Penn TreeBank Corpus Parsing / Analyzer
  • Twitter TreeBank Corpus Parsing / Analyzer
  • IULA Spanish LSP TreeBank Parsing / Analyzer
  • Universal Part of Speech Tagset Mappings
  • Tree Tagger Part of Speech Tagger
  • Text Analyzer (concepts, relations (concept-relationship-concept), filtering, stop-words, tagging)
  • Corpus Coverage Testing
  • Vocabulary Merging
  • Sentiment Terms
  • Web Crawling
  • Document Search (backed by Redis)

Tokenization

There are plenty of libraries that perform tokenization, this library is no different, the only exception is that this library will also do some tokenization steps necessary to cleanup random HTML, XML, Wiki, Twitter and other sources. More examples are in the specs directory. Tokenizers in salient are built on top of each other and include the following:

  • Tokenizer (lib/salient/tokenizers/tokenizer.js) abstract
  • RegExpTokenizer (lib/salient/tokenizers/regexp_tokenizer.js) extends: Tokenizer
var salient = require('salient');
var tokenizer = new salient.tokenizers.RegExpTokenizer({ pattern: /\W+/ });
tokenizer.tokenize('these are things');
> ['these', 'are', 'things']
  
  • UrlTokenizer (lib/salient/tokenizers/url_tokenizer.js) extends: Tokenizer
var salient = require('salient');
var tokenizer = new salient.tokenizers.UrlTokenizer();
tokenizer.tokenize('Some text a http://en.wikipedia.org/wiki/Liverpool_F.C._(disambiguation) wikipedia url in it.');
> ['http://en.wikipedia.org/wiki/Liverpool_F.C._(disambiguation)']
  
  • WordPunctTokenizer (lib/salient/tokenizers/wordpunct_tokenizer.js) extends: RegExpTokenizer
    • Handles Time, Numerics (including 1st, 2nd..etc), numerics with commas/decimals/percents, $, words with hyphenations, words with and without accepts, with and without apostrophes, punctuations and optional emoticon preservation.
var salient = require('salient');
var tokenizer = new salient.tokenizers.WordPunctTokenizer();
tokenizer.tokenize('From 12:00 am-11:59 pm. on Nov. 12th, you can make donation online to support Wylie Center.')
> [ 'From', '12:00 am', '-', '11:59 pm.', 'on', 'Nov', '.', '12th', ',', 'you', 'can', 'make', 'donation', 'online', 'to', 'support', 'Wylie', 'Center', '.' ]

// preserve emoticons
tokenizer = new salient.tokenizers.WordPunctTokenizer({ preserveEmoticons: true })
tokenizer.tokenize('data, here to clean.Wouldn\'t you say so? :D I have $20.45, are you ok? >:(');
> [ 'data', ',', 'here', 'to', 'clean', '.', 'Wouldn', '\'', 't', 'you', 'say', 'so', '? ', ':D', 'I', 'have', ' $', '20.45', ',', 'are', 'you', 'ok', '?', '>:(' ]

  • EmoticonTokenizer (lib/salient/tokenizers/emoticon_tokenizer.js) extends: RegExpTokenizer
    • Handles basic emoticons pattern shown in the given file
var salient = require('salient');
var tokenizer = new salient.tokenizers.EmoticonTokenizer();
tokenizer.tokenize('RT @sampleusername: this is typical :)')
> [ ':)' ]
  • ArticleTokenizer (lib/salient/tokenizers/article_tokenizer.js) extends: Tokenizer
    • Includes: UrlTokenizer, WordPunctTokenizer, RegExpTokenizer
    • Leverages: underscore
var salient = require('salient');
var tokenizer = new salient.tokenizers.ArticleTokenizer({ compressWhitespace: true });
var text = "In 1927, working with this diagrammatic form of knots, [[J.W. Alexander]] and G. B. Briggs, and independently [[Kurt Reidemeister]], demonstrated that two knot diagrams belonging to the same knot can be related by a sequence of three kinds of moves on the diagram, shown below. These operations, now called the '''Reidemeister moves''', are:" + 
'<ol style="list-style-type:upper-Roman">' + 
'\r\n        <li> Twist and untwist in either direction.</li>' + 
'\r\n        <li> Move one strand completely over another.</li>' + 
'\r\n        <li> Move a strand completely over or under a crossing.</li>' + 
'        </ol>' + 
'' + 
'    {| align="center" style="text-align:center"' + 
"        |+ '''Reidemeister moves'''" + 
'        |- style="padding:1em"' + 
"        | [[Image:Reidemeister move 1.png|130px|]] [[File:Frame left.png]] || [[Image:Reidemeister move 2.png|210px]]" + 
"        |-" + 
"        | <u>Type I</u> || <u>Type II</u>" + 
'        |- style="padding:1em"' + 
'        | colspan="2" | [[Image:Reidemeister move 3.png|360px]]' + 
'        |-' + 
'        | colspan="2" | <u>Type III</u>' + 
"        |}";
tokenizer.clean(text);
> "In 1927, working with this diagrammatic form of knots, J.W. Alexander and G. B. Briggs, and independently Kurt Reidemeister, demonstrated that two knot diagrams belonging to the same knot can be related by a sequence of three kinds of moves on the diagram, shown below. These operations, now called the '''Reidemeister moves''', are: Twist and untwist in either direction. Move one strand completely over another. Move a strand completely over or under a crossing."
  • TweetTokenizer (lib/salient/tokenizers/tweet_tokenizer.js) extends: ArticleTokenizer
var salient = require('salient');
var tokenizer = new salient.tokenizers.TweetTokenizer();
tokenizer.tokenize('RT @sampleusername: this is typical :)')
> [ 'RT', '@sampleusername', ':', 'this', 'is', 'typical', ':)' ]

Part of Speech Tagging

Part of speech tagging is done primarily through the use of the trigram hidden-markov model. While there are many methods used since then, Trigram HMM, seems to be the easiest to implement while maintaining an effective accuracy. This was built through the use of several resources online including bootstrapping the vocabulary using Wiktionary (https://www.wiktionary.org/). This is a common alternative technique to the unsupervised learning technique by providing a bit of an edge to the model with an existing dictionary of sorts. In some cases, the dictionary can be generated from a part of speech corpus (sometimes manually or automatically tagged).

On top of Wiktionary, I am using several corpus to build the English language model including: Brown Corpus, Penn TreeBank, Twitter TreeBank. These treebanks provide a resource for calculating and training the model for supervised learning cases. The actually tagging portion is done using the Viterbi path finding algorithm implemented for all standard models. The spanish model is trained using the IULA Spanish LSP TreeBank. You will notice both models are stored in the bin directory.

More information is provided on the building of this model in Trigram Hidden Markov Model Part of Speech Tagger.

var salient = require('salient');
var hmmTagger = new salient.tagging.HmmTagger();
hmm.tag([ 'How', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'looks', 'beautiful', 'today', '!' ]);
> [ 'ADV', 'VERB', 'PRON', 'VERB', 'NOUN', '.', 'DET', 'NOUN', 'VERB', 'ADJ', 'NOUN', '.' ]

Glossary

Glossary is sometimes used for looking up concepts, terms or relationships between terms. This is far from perfect, but it gives a good usecase for some information retrieval. You may find other libraries do this better, but I'm currently using this part of the library to build towards sentiment analysis use cases.

Glossary has things that I found useful when building the sentiment analysis portion which include catalogging things like copulae verbs, linking verbs, terms that are often filtered (i.e. stop terms), question terms, time sensitive nouns, amplifiers, clauses, coordinating conjunctions, negations, conditionals (ORs), and contractions. All these proved very useful in the sentiment analysis stage where the particular algorithm I implemented is described in more detail below.

As you can see in the code sample below, I have done a bit of chunking for terms and with some common filtering rules I can combine filtered terms, determiners with their nouns..etc. Additionally, the output of glossary helps me debug when something gets parsed oddly. Nevertheless, with this utility I can follow the flow of logic in a given sentence and make it easier to look at relationships between terms.

var salient = require('salient');
var glossary = new salient.glossary.Glossary();
glossary.parse("This is going to be an awesome test");
glossary.toJSON();

{ term: 'This is going to be',
  distinct: 'is going to be',
  tag: 'VERB',
  children:
   [ { term: 'is',
       tag: 'VERB',
       position: 1,
       isVerb: true,
       isQTerm: true,
       isCop: true,
       isLink: true,
       isFiltered: true },
     { term: 'going', tag: 'VERB', position: 2, isVerb: true },
     { term: 'to',
       tag: 'PRT',
       position: 3,
       isPrt: true,
       isTo: true,
       isFiltered: true },
     { term: 'be',
       tag: 'VERB',
       position: 4,
       isVerb: true,
       isFiltered: true } ],
  beginsDet: true,
  isVerb: true,
  orig:
   { term: 'This',
     distinct: 'this',
     tag: 'DET',
     isDet: true,
     isFiltered: true },
  termMap: [ 'This', 'is', 'going', 'to', 'be' ] }
{ term: 'an awesome test',
  tag: 'NOUN',
  position: 5,
  children:
   [ { term: 'awesome',
       tag: 'ADJ',
       position: 6,
       isAdj: true,
       isAdjCard: true },
     { term: 'test',
       tag: 'NOUN',
       position: 7,
       isNoun: true,
       isCop: true } ],
  beginsDet: true,
  isNoun: true,
  orig:
   { term: 'an',
     tag: 'DET',
     position: 5,
     isDet: true,
     isFiltered: true },
  termMap: [ 'an', 'awesome', 'test' ],
  distinct: 'awesome test' }

The flow of logic is further helped when I include negations of any sort show below. The inclusion of a negation breaks the flow of our logic and instead negates the rest of our sentence. This is explored further in the sentiment analysis stage of the library.

var salient = require('salient');
var glossary = new salient.glossary.Glossary();
glossary.parse("This is never going to be an awesome test");
glossary.toJSON();

{ term: 'This is',
  distinct: 'is',
  tag: 'VERB',
  children:
   [ { term: 'is',
       tag: 'VERB',
       position: 1,
       isVerb: true,
       isQTerm: true,
       isCop: true,
       isLink: true,
       isFiltered: true } ],
  beginsDet: true,
  isVerb: true,
  isFiltered: true,
  orig:
   { term: 'This',
     distinct: 'this',
     tag: 'DET',
     isDet: true,
     isFiltered: true },
  termMap: [ 'This', 'is' ] }
{ term: 'never',
  tag: 'ADV',
  position: 2,
  isAdv: true,
  isNeg: true }
{ term: 'going to be',
  tag: 'VERB',
  position: 3,
  children:
   [ { term: 'to',
       tag: 'PRT',
       position: 4,
       isPrt: true,
       isTo: true,
       isFiltered: true },
     { term: 'be',
       tag: 'VERB',
       position: 5,
       isVerb: true,
       isFiltered: true } ],
  isVerb: true,
  isLink: true,
  orig: { term: 'going', tag: 'VERB', position: 3, isVerb: true },
  termMap: [ 'going', 'to', 'be' ] }
{ term: 'an awesome test',
  tag: 'NOUN',
  position: 6,
  children:
   [ { term: 'awesome',
       tag: 'ADJ',
       position: 7,
       isAdj: true,
       isAdjCard: true },
     { term: 'test',
       tag: 'NOUN',
       position: 8,
       isNoun: true,
       isCop: true } ],
  beginsDet: true,
  isNoun: true,
  orig:
   { term: 'an',
     tag: 'DET',
     position: 6,
     isDet: true,
     isFiltered: true },
  termMap: [ 'an', 'awesome', 'test' ],
  distinct: 'awesome test' }

You can additionally use the output of the glossary.parse function to retreive a simple concepts map. This effectively is looking for noun terms within the parsed output of the glossary. So the above example would look like so:

var salient = require('salient');
var glossary = new salient.glossary.Glossary();
glossary.parse("This is never going to be an awesome test");
glossary.concepts();

[ 'awesome test' ]

Sentiment Analysis

The approach I took to sentiment analysis builds on top of a rather simple naive bayes approach. The naive bayes approach is such that we have a set of buckets which are classified between varying degrees of positive and negative and neutral. The terms in each category should be up to n-gram terms. The sentiment algorithm combines the use of amplifiers (terms that tend to amplify or show additional excitement on an existing term; i.e. 'crazy good').

It makes use of most of the features obtained from the glossary above (including negations). The analyzer will look for scorable terms that our bayes buckets specify, it will look for LCS (longest common substring) and determine whether the terms are even scorable due to varying rules due to filtering. Once we've effectively scored all the possible terms in our text, we can go through the text once again in a node by node sequence (using a finite state machine) to detect things like conditionals, negations, semantic clauses, amplifiers, inclusions (AND), or final orientation terms such as hashtags which may negate the entire text (common in the case of sarcastic tweets).

Finally, once we've determined the final polarity of our text, we give it a cumulative score. This process also identifies semantic orientation on a per term basis, which means we can go back and actually see the orientation of individual terms.

var salient = require('salient');
var analyser = new salient.sentiment.BayesSentimentAnalyser();
analyser.classify("That product will never be good at disappointing their users")
> 1.5
analyser.classify("I love how Wall Street screwed things up")
> -1.5
analyser.classify("I'm dying to have a snickers bar")
> 1
analyser.classify("I am loving the 10-second page loads on this site #not")
> -2

While the above test cases are cool, it doesn't mean this is some super magical system that can get it right all the time. Language is complex and finicky - and sentiment analysis requires inductive reasoning along with a load of other AI problems. However, that being said, you should see a reasonable amount of cases are shown in bin/tests.

Notes

It should be noted that most machine learning algorithms would be better suited in environments that can take advantage of many cores, such as in the case of GPU accelerated machine learning. Such things are necessary in order to speed up the learning rate as well as the task at hand given that many complex linear algebra operations can be done efficiently in parallel. As a result, this project is more of an example test case implementation for a wide-variety of machine learning and artificial intelligence problems. For more robust implementations, it is recommended that you glean from my implementations and others (i.e. by reference to Andrew Ng's Machine Learning course) and use that within the scope of your projects. However, there are other techniques such as map-reduce that may be able to improve the performance of running some of these operations within this package on multiple cores and multiple systems in parallel.

License

Licensed under GPLv2

Copyright (c) 2015 Thomas Holloway (@nyxtom)

More Repositories

1

drawing-webrtc

Simple proof of concept realtime collaborative drawing 🎨 🖌🌳💻
JavaScript
90
star
2

reconcile.js

Simple reconciliation, diff and merge for the DOM inspired by Facebook's React Reconcilation Algorithm
JavaScript
54
star
3

dive

Various standard scuba diving algorithms, formulas, calculations and device management. ✨ 🐢 🚀 ✨
JavaScript
51
star
4

async-in-depth-rust-series

Repository for Rust Projects on the Async in Depth Series
Rust
31
star
5

webterm

webterm is a simple web-based terminal built with golang and javascript
JavaScript
19
star
6

dotfiles

Vim Script
16
star
7

viterbi

viterbi is an implementation of the viterbi path finding algorithm (written in go) used within the context of a trigram hidden markov model (HMM)
Go
15
star
8

treetagger

Node.js module for interfacing with the TreeTagger toolkit by Helmut Schmid.
JavaScript
15
star
9

rust-twitter-clone

Rust based implementation of a full stack web application Twitter clone using the Tide Web Framework
Rust
12
star
10

n3

Reusable charting and graphing visualizations built on top of d3.js.
JavaScript
10
star
11

html5-boilerplate

Provides a simple boilerplate HTML5 web layout including svgweb, sass, html5shiv, ie9.js, jquery, jquery templates and hashgrid.
JavaScript
7
star
12

node-term-spinner

Terminal spinner for node.js (port of visionmedia's go-spin, since I like the spinners by default)
JavaScript
6
star
13

anderson-dark-syntax

Atom syntax theme based on the anderson.vim
CSS
6
star
14

dataminer

Node.js fault tolerant, distributed queue workers for your everyday computation.
JavaScript
5
star
15

tide-jwt

Simple jwt (jsonwebtoken) authorization middleware for the tide web framework in Rust
Rust
5
star
16

broadcast

Broadcast is a high performance network daemon designed to run generic command topologies such as game servers, apis, statsd or redis clones
Go
5
star
17

voted

A real-time heatmap tracking the #voted,#ivoted for election day 2012.
JavaScript
4
star
18

workclient

Simple abstraction for common worker routines, heartbeat to etcd, metrics to statsd/influx/graphite, boilerplate pipelining...etc
Go
4
star
19

tokens

Tokens is a simple nlp utility (written in go) for tokenizing strings using common split regular expressions for whitespace, words, emoticons, urls and more.
Go
4
star
20

text-tree

Simple data structures and algorithms revolved mostly around tries, suffix trees, suffix arrays and other text mining activities.
JavaScript
4
star
21

mphf

Minimal perfect hash table implementation written in go
Go
3
star
22

connectfour

A simple html/css/javascript based connect four game demonstrating the use of jasmine tests.
JavaScript
3
star
23

pysoflo

Repo for presentations and various things done at pysoflo
JavaScript
3
star
24

openauthsharp

C# Open Auth library based on dotnetopenauth with fewer overall dependencies and support for mono >= 2.10. OAuth 1, OAuth 2 support for Twitter, Facebook, GitHub, Google and Standard OAuth... etc.
C#
2
star
25

node-twokenize

Node.js port and update of Jason Baldrige's Twokenize with layers for hashtag, @mention, email and urls.
JavaScript
2
star
26

anderson-dark-ui

Atom theme based on the anderson.vim (UI Theme)
CSS
2
star
27

python-puzzles

Simple problems and exercises written in python
Python
2
star
28

saywhat

A sweet simple awesome example of using socketio, redis, gevent and django
Python
1
star
29

tictactoe

SImple tictactoe app that I wrote a while back with some html5/css3 quirks
Python
1
star
30

awesome-ukulele

awesome ukulele is a repository for storing and collecting awesome resources for the ukulele
1
star
31

async-variadic

Simple Async Variadic Functions in Rust with Trait Bounds
Rust
1
star
32

fbpuzzles

Puzzles for various problems proposed on Facebook and in the wild
Python
1
star
33

async-middleware

Simple async monad(ish) middleware in Rust using tuples and async trait transforms
Rust
1
star
34

c-experiments

Micro experiments in C demonstrating how to do various things and such
C
1
star
35

broadcast-graph

Broadcast-graph is a simple graph backend example for broadcast
C
1
star