dclassify

dclassify is an optimized Naive Bayesian classifier for NodeJS that goes one step further than your ordinary binary classifier by introducing a unique "probablility of absence" feature. In some test cases this has led to a ~10% improvement over conventional binary classifiers. It designed for classifying items based on a limited set of characteristics (aka "tokens") rather than for general language processing.

Intro to Machine Learning with Node.JS

View slides from a talk presented at OttawaJS.

Optimization

The "probability of absence" optimization can be enabled with the applyInverse option. When this option is set to true, dclassify will calculate probabilities based on the present tokens as well as the inverse - a probability of absence for tokens that are not present. This is unconventional but can produce better results when classifying items based on a limited set of characteristics, especially when some of those characteristics are nearly always present in one of the categories.

Most binary classifiers work by looking for specific tokens to be present. For example, an email spam filter might categorize emails by looking for words that are considered spam-related. It cares about the words that are present, but not about words that are absent, because there are just too many of them.

In other cases we care if important key ingredients are missing. For example, an animal without wings is most likely not a bird. And a mobile app without internet connectivity is most likely not malware. Such "prevalent negatives" can be quite effective if the total set of tokens is fairly small (say, a few hundred items) and includes such key ingredients. Using the applyInverse option to look for prevalent negatives can significantly improve the results in such cases.

Installation

npm install dclassify

Usage

Require the classifier and reference its utilities.
Create Document instances with names and an array of tokens representing the document's characteristics.
Add document instances to a DataSet using appropriate categories.
Create and train a classifier using the DataSet.
Test the classifier using a test Document.

    // module dependencies
    var dclassify = require('dclassify');

    // Utilities provided by dclassify
    var Classifier = dclassify.Classifier;
    var DataSet    = dclassify.DataSet;
    var Document   = dclassify.Document;
    
    // create some 'bad' test items (name, array of characteristics)
    var item1 = new Document('item1', ['a','b','c']);
    var item2 = new Document('item2', ['a','b','c']);
    var item3 = new Document('item3', ['a','d','e']);

    // create some 'good' items (name, characteristics)
    var itemA = new Document('itemA', ['c', 'd']);
    var itemB = new Document('itemB', ['e']);
    var itemC = new Document('itemC', ['b','d','e']);

    // create a DataSet and add test items to appropriate categories
    // this is 'curated' data for training
    var data = new DataSet();
    data.add('bad',  [item1, item2, item3]);    
    data.add('good', [itemA, itemB, itemC]);
    
    // an optimisation for working with small vocabularies
    var options = {
        applyInverse: true
    };
    
    // create a classifier
    var classifier = new Classifier(options);
    
    // train the classifier
    classifier.train(data);
    console.log('Classifier trained.');
    console.log(JSON.stringify(classifier.probabilities, null, 4));
    
    // test the classifier on a new test item
    var testDoc = new Document('testDoc', ['b','d', 'e']);    
    var result1 = classifier.classify(testDoc);
    console.log(result1);

The output of the above will be:

{
    "category": "good",
    "probability": 1,
    "timesMoreLikely": "Infinity",
    "secondCategory": "bad",
    "probabilities": [
        { "category": "good", "probability": 0.09876543209876543 },
        { "category": "bad", "probability": 0 }
    ]
}

Probabilities

The probabilities get calculated like this.

    {
        "bad": {
            "a": 1,
            "b": 0.6666666666666666,
            "c": 0.6666666666666666,
            "d": 0.3333333333333333,
            "e": 0.3333333333333333
        },
        "good": {
            "a": 0,
            "b": 0.3333333333333333,
            "c": 0.3333333333333333,
            "d": 0.6666666666666666,
            "e": 0.6666666666666666
        }
    }

Output

Standard results (without the applyInverse: true option) would look like this:

    {
        "category": "good",
        "probability": 0.6666666666666666,
        "timesMoreLikely": 2,
        "secondCategory": "bad",
        "probabilities": [
            { "category": "good", "probability": 0.14814814814814814},
            { "category": "bad", "probability": 0.07407407407407407}
        ]
    }

If you use the applyInverse: true option, the results are much more emphatic, because training indicates bad items never lack the "a" token.

    {
        "category": "good",
        "probability": 1,
        "timesMoreLikely": "Infinity",
        "secondCategory": "bad",
        "probabilities": [
            { "category": "good", "probability": 0.09876543209876543 },
            { "category": "bad", "probability": 0 }
        ]
    }

73rhodes/dclassify

73rhodes

Reviews

Repository Details