• Stars
    star
    142
  • Rank 258,495 (Top 6 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 10 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

text mining utilities for Node.js

NPM version Build Status Coverage Status

text-miner

text mining utilities for node.js

Introduction

The text-miner package can be easily installed via npm:

npm install text-miner

To require the module in a project, we can use the expression

var tm = require( 'text-miner' );

Corpus

The fundamental data type in the text-miner module is the Corpus. An instance of this class wraps a collection of documents and provides several methods to interact with this collection and perform post-processing tasks such as stemming, stopword removal etc.

A new corpus is created by calling the constructor

var my_corpus = new tm.Corpus([]);

where [] is an array of text documents which form the data of the corpus. The class supports method chaining, such that mutliple methods can be invoked after each other, e.g.

my_corpus
	.trim()
	.toLower()

The following methods and properties are part of the Corpus class:

Methods

.addDoc(doc)

Add a single document to the corpus. Has to be a string.

.addDocs(docs)

Adds a collection of documents (in form of an array of strings) to the corpus.

.clean()

Strips extra whitespace from all documents, leaving only at most one whitespace between any two other characters.

.map(fun)

Applies the function supplied to fun to each document in the corpus and maps each document to the result of its respective function call.

.removeInterpunctuation()

Removes interpunctuation characters (! ? . , ; -) from all documents.

.removeNewlines()

Removes newline characters (\n) from all documents.

.removeWords(words[, case_insensitive])

Removes all words in the supplied words array from all documents. This function is usually invoked to remove stopwords. For convenience, the text-miner package ships with a list of stopwords for different languages. These are stored in the STOPWORDS object of the module.

Currently, stopwords for the following languages are included:

STOPWORDS.DE
STOPWORDS.EN
STOPWORDS.ES
STOPWORDS.IT

As a concrete example, we could remove all english stopwords from corpus my_corpus as follows:

my_corpus.removeWords( tm.STOPWORDS.EN )

The second (optional) parameter of the function case_insensitive expects a Boolean indicating whether to ignore cases or not. The default value is false.

.removeDigits()

Removes any digits occuring in the texts.

.removeInvalidCharacters()

Removes all characters which are unknown or unrepresentable in Unicode.

.stem(type)

Performs stemming of the words in each document. Two stemmers are supported: Porter and Lancaster. The former is the default option. Passing "Lancaster" to the type parameter of the function ensured that the latter one is used.

.toLower()

Converts all characters in the documents to lower-case.

.toUpper()

Converts all characters in the documents to upper-case.

.trim()

Strips off whitespace at the beginning and end of each document.

DocumentTermMatrix / TermDocumentMatrix

We can pass a corpus to the constructor DocumentTermMatrix in order to create a document-term-matrix or a term-document matrix. Objects derived from either share the same methods, but differ in how the underlying matrix is represented: A DocumentTermMatrix has documents on its rows and columns corresponding to words, whereas a TermDocumentMatrix has rows corresponding to words and columns to documents.

var terms = new tm.DocumentTermMatrix( my_corpus );

An instance of either DocumentTermMatrix or TermDocumentMatrix has the following properties:

Properties

.vocabulary

An array holding all the words occuring in the corpus, in order corresponding to the column entries of the document-term matrix.

.data

The document-term or term-document matrix, implemented as a nested array in JavaScript. Rows correspond to individual documents, while each column index corresponds to the respective word in vocabulary. Each entry of data holds the number of counts the word appears in the respective documents. The array is sparse, such that each entry which is undefined corresponds to a value of zero.

.nDocs

The number of documents in the term matrix

.nTerms

The number of distinct words appearing in the documents

Methods

.findFreqTerms( n )

Returns all terms in alphabetical ordering which appear n or more times in the corpus. The return value is an array of objects of the form {word: "<word>", count: <number>}.

.removeSparseTerms( percent )

Remove all words from the document-term matrix which appear in less than percent of the documents.

.weighting( fun )

Apply a weighting scheme to the entries of the document-term matrix. The weighting method expects a function as its argument, which is then applied to each entry of the document-term matrix. Currently, the function weightTfIdf, which calculates the term-frequency inverse-document-frequency (TfIdf) for each word, is the only built-in weighting function.

.fill_zeros()

Turn the document-term matrix dtm into a non-sparse matrix by replacing each value which is undefined by zero and save the result.

Utils

The module exports several other utility functions.

.expandContractions( str )

Replaces all occuring English contractions by their expanded equivalents, e.g. "don't" is changed to "do not". The resulting string is returned.

.weightTfIdf( terms )

Weights document-term or term-document matrix terms by term frequency - inverse document frequency. Mutates the input DocumentTermMatrix or TermDocumentMatrix object.

Data

.STOPWORDS

An object with four keys: DE, EN, ES and IT, each of which is an array of stopwords for the German, English, Spanish and Italian language, respectively.

{
	"EN": [
		"a",
		"a's",
		"able",
		"about",
		"above",
		// (...)  
	],
	"DE": [
		// (...)
	],
	// (...)
}

.CONTRACTIONS

The keys of the CONTRACTIONS object are the contracted expressions and the corresponding values are arrays of the possible expansions.

{
	"ain't": ["am not", "are not", "is not", "has not","have not"],
	"aren't": ["are no", "am not"],
	"can't": ["cannot"],
	// (...)
}

Unit Tests

Run tests via the command npm test


License

MIT license.

More Repositories

1

node-word2vec

Node.js interface to the Google word2vec tool.
C
347
star
2

node-wordnet-magic

tools for working with Princeton's lexical database WordNet
JavaScript
74
star
3

node-Rstats

[UNMAINTAINED] An interface for node.js to statistical programming language R based on the fabulous Rcpp package
C++
58
star
4

deidentify

De-identification of Protected Health Information according to HIPAA Privacy Rule
JavaScript
38
star
5

node-concept-net

node.js interface to the ConceptNet semantic network API [DEPRECATED; ConceptNet API has changed]
JavaScript
29
star
6

emscripten-examples

collection of emscripten use cases of how to port C/C++ functions to JS
JavaScript
29
star
7

kernel-smooth

nonparametric kernel smoothing
JavaScript
20
star
8

tweet-sentiment

SVM Classifier to Detect Sentiment of Tweets
JavaScript
16
star
9

order

Returns a permutation which rearranges an input array.
CoffeeScript
9
star
10

SVD

singular value decomposition via emscripten
JavaScript
9
star
11

node-sparqling-star

node.js client for creating SPARQL queries and communicating with services like DBpedia
JavaScript
8
star
12

node-MetaMap

Package providing access to the MetaMap Web API for UMLS
HTML
8
star
13

node-chvocab

Mapping texts to UMLS via the Consumer Health Vocabulary (CHV)
JavaScript
7
star
14

plusArrays.js

Extensions of the Array.prototype object
JavaScript
7
star
15

fisher-transform

inference for correlation rho via fisher transformation
JavaScript
6
star
16

node-wandbox-api

Node.js bindings to the Wandbox API.
JavaScript
6
star
17

R-Introduction

short tutorial for the statistical programming language R
HTML
6
star
18

wordnetify

Command line interface to turn text documents into WordNet synset trees
CoffeeScript
5
star
19

node-armadillo

node.js wrapper for Armadillo C++ linear algebra library
C++
5
star
20

hackernews-ai-digest

Daily Digest of HackerNews submissions related to AI.
MDX
5
star
21

qality.js

Multiple-Choice QA System for JavaScript
JavaScript
4
star
22

liquid-screen

Easy Animations.
JavaScript
4
star
23

node-spam-detector

small utility app to detect spam URLs
JavaScript
4
star
24

multtest

adjustments of p-values for multiple comparisons
JavaScript
4
star
25

guess-age

Guess the age of a person only from their first name.
JavaScript
4
star
26

anagrams

find anagrams and similar sounding words
JavaScript
4
star
27

ndarray-inv

calculating matrix inverses
JavaScript
4
star
28

feedback-buttons

jQuery plugin for adding feedback buttons to HTML elements.
JavaScript
3
star
29

react-latex

React LaTeX Component: http://cmu-isle.github.io/react-latex.
JavaScript
3
star
30

GameOfLife.js

(Yet) another implementation of Horton Conway's Game of Life
JavaScript
3
star
31

generator-node-sweetjs

Yeoman generator for node.js module powered by sweet.js macros
JavaScript
3
star
32

statbytes-js

StatBytes presentation on JavaScript
JavaScript
3
star
33

node-wordnet-visualizer

Visualization for WordNet objects as created by the wordnet-magic package
JavaScript
3
star
34

gsl-js

C
2
star
35

GSoC-Application

TeX
2
star
36

insert-equations

Parse README.md files and turn LaTeX equations into SVGs.
JavaScript
2
star
37

StochasticProcessesNotes

Notes for the course 36-733: Probability models and Stochatic Processes
TeX
2
star
38

Flashcards

Flashcards for important theorems and definitions in statistics
TeX
2
star
39

detect-generator-support

Detect native generator function support.
Makefile
2
star
40

discrete-markovchain

discrete time, discrete state space Markov chain package
JavaScript
2
star
41

node-wordnet-JSON

Visualization frontend for WordNet trees savid in JSON format
JavaScript
2
star
42

traj-utilities

Calculate conditional probabilities for a multi-trajectory model fitted by traj.
JavaScript
2
star
43

polymer-codebox

A Polymer web component for evaluating JavaScript code.
HTML
2
star
44

node-make-latex

convert JS objects to LaTeX tables from inside node.js
JavaScript
2
star
45

instruction

Introduction to Probability Distributions and Hypothesis Testing.
HTML
2
star
46

data-science-survival-kit

Slides for Data Science Survival Kit Workshop
JavaScript
1
star
47

stdlib-dependency-test

A demo repository for a library which depends both directly on stdlib and indirectly through a dependency.
JavaScript
1
star
48

test-github-create-repo

1
star
49

Insights-From-Kidney-Space-2

First Heinz Paper presentation
HTML
1
star
50

todo-list

1
star
51

spell-check

simple spelling checker
CoffeeScript
1
star
52

express-saml2

JavaScript
1
star
53

isle-ace-builds

Ace editor builds for ISLE editor
JavaScript
1
star
54

locality-sensitive-hashing

locality-sensitive hashing for nearest neighbor search
1
star
55

chi2gof-test

Test repository for running `chi2gof`
JavaScript
1
star
56

reproducible-error-reports

Reproducible example of screen.find erroring.
JavaScript
1
star
57

ScienceforumsPresentation

R
1
star
58

react-code-playground

Created with CodeSandbox
JavaScript
1
star
59

empty

Empty package.
JavaScript
1
star
60

bundle-sizes

Test repository to compare bundle sizes.
JavaScript
1
star
61

math-snippets

LaTeX Snippets for Atom.io
CoffeeScript
1
star
62

stdlib-realtime-statistics-code

Code snippets accompanying article on realtime statistics with stdlib
JavaScript
1
star
63

gsl-polynomial-js

functions for evaluating and solving polynomial equations in JS via GNU Scientific Library
1
star
64

introduction-to-stdlib

Talk for the International JavaScript Conference, iJS, London (2018).
JavaScript
1
star
65

StickBreakingConstruction

Shiny visualization of stick-breaking construction of Dirichlet Process
R
1
star
66

node-wordnetify-sample

generate random samples from Wordnetify output
JavaScript
1
star