• This repository has been archived on 17/Jan/2018
  • Stars
    star
    449
  • Rank 97,328 (Top 2 %)
  • Language
    Ruby
  • License
    MIT License
  • Created almost 13 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

simple text classifier(s) implemetation in ruby

stuff-classifier

No longer maintained

This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.

Description

A library for classifying text into multiple categories.

Currently provided classifiers:

Ran a benchmark of 1345 items that I have previously manually classified with multiple categories. Here's the rate over which the 2 algorithms have correctly detected one of those categories:

  • Bayes: 79.26%
  • Tf-Idf: 81.34%

I prefer the Naive Bayes approach, because while having lower stats on this benchmark, it seems to make better decisions than I did in many cases. For example, an item with title "Paintball Session, 100 Balls and Equipment" was classified as "Activities" by me, but the bayes classifier identified it as "Sports", at which point I had an intellectual orgasm. Also, the Tf-Idf classifier seems to do better on clear-cut cases, but doesn't seem to handle uncertainty so well. Of course, these are just quick tests I made and I have no idea which is really better.

Install

gem install stuff-classifier

Usage

You either instantiate one class or the other. Both have the same signature:

require 'stuff-classifier'

# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")

# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")

# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)

# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]

Training the classifier:

cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

And finally, classifying stuff:

cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat

cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?") 
#=> :dog
cls.classify("What pet will I love more?")    
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.") 
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog

Persistency

The following layers for saving the training data between sessions are implemented:

  • in memory (by default)
  • on disk
  • Redis
  • (coming soon) in a RDBMS

To persist the data in Redis, you can do this:

# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)

# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})

To persist the data on disk, you can do this:

store = StuffClassifier::FileStorage.new(@storage_path)

# global setting
StuffClassifier::Base.storage = store

# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)

# after training is done, to persist the data ...
cls.save_state

# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
  # when done, save_state is called on END
end

# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)

The name you give your classifier is important, as based on it the data will get loaded and saved. For instance, following 3 classifiers will be stored in different buckets, being independent of each other.

cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")	

License

MIT Licensed. See LICENSE.txt for details.

More Repositories

1

scala-best-practices

A collection of Scala best practices
4,382
star
2

typelevel-library.g8

Template for starting FP libraries ready to be published
Scala
85
star
3

crossdomain-requests-js

Light Javascript library for making crossdomain Ajax requests.
JavaScript
57
star
4

OldBuzzEngine

Commenting system, similar to Disqus, but simpler and running on Google App Engine
Python
53
star
5

ThinkInCPP4Kindle

Thinking in C++, by Bruce Eckel, formated for Kindle (.mobi format)
Ruby
53
star
6

sincron

Concurrency tools for Scala, with equivalents for Scala.js.
Scala
50
star
7

alexn.org

The source-code of my personal blog
HTML
39
star
8

github-webhook-listener

Light server for reacting to GitHub's Webhooks
Kotlin
39
star
9

AndroidMarketCrawler

Simple script that crawls the Android Marketplace
Python
35
star
10

async-assignment

For educational purposes, or for interviewing Java / Scala developers.
Java
23
star
11

shifter

Toolkit for building lean and mean web services in Scala
Scala
17
star
12

social-publish

Self-hosted service that helps me share to multiple social media platforms at the same time.
TypeScript
9
star
13

days-since

Simple command line for showing the number of passed days since a certain date
Haskell
8
star
14

pinboard-popup

A really simple Firefox and Chrome extension that just triggers the Pinboard Popup.
JavaScript
7
star
15

parolamea

Password Generator
CSS
7
star
16

emacs.mmbundle

MailMate integration bundle with Emacs, to use as external editor
Shell
6
star
17

photostat

own utility for managing a huge photo collection (work in progress, incomplete, not user friendly)
Ruby
6
star
18

vimeo-utils

Server for downloading / redirecting to raw videos (mp4 files) on videos published by Vimeo Plus accounts
Scala
6
star
19

advent-of-code

Rust
6
star
20

pocket-export

Utility to export bookmarks along with tags from getpocket.com
TypeScript
5
star
21

aws-upload-sample

Scala
4
star
22

http-dope

HTTP utils for my own use
Scala
4
star
23

buzzengine

Commenting system, similar to Disqus, but simpler and yours
Scala
4
star
24

my-typelevel-library

Sample generated via https://github.com/alexandru/typelevel-library.g8
Scala
3
star
25

emacs.d

my personal emacs configuration
Emacs Lisp
2
star
26

docker-images

Docker image meant for JDK-based builds.
Makefile
2
star
27

echo-request

A simple HTTP server that mirrors back the request headers and user info
Scala
2
star
28

directio

Scala
2
star
29

scalax

Scala
1
star
30

zio-to-scalaz-browser-extension

A browser extension that changes all ZIO references in text to ScalaZ, to reduce confusion.
JavaScript
1
star
31

news

Kotlin
1
star
32

monix-playground

Project starter for a Monix playground project
Scala
1
star
33

django-asynctasks

Asynchronous tasks processing for Django, alternative to Celery
Python
1
star
34

showmyrequest

A simple web service that displays the details of the request being made.
Scala
1
star
35

yotweetbackup

Simple script for backing up your Twitter's history
1
star
36

lesspass-pure

pure component in vuejs used by cozy, frontend and webextension
Vue
1
star