• Stars
    star
    100
  • Rank 338,647 (Top 7 %)
  • Language
    Ruby
  • License
    GNU General Publi...
  • Created almost 12 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Text classifier in Ruby that uses Hadoop/HBase, Mongo, or Cassandra for storage. New location for http://github.com/livingsocial/ankusa

ankusa¶ ↑

<img src=“https://secure.travis-ci.org/bmuller/ankusa.png?branch=master” alt=“Build Status” /> <img src=“https://gemnasium.com/bmuller/ankusa.png” alt=“Dependency Status” />

Ankusa is a text classifier in Ruby that can use either Hadoop’s HBase, Mongo, or Cassandra for storage. Because it uses HBase/Mongo/Cassandra as a backend, the training corpus can be many terabytes in size (though additional memory and single file storage abilities also exist for smaller corpora).

Ankusa currently provides both a Naive Bayes and Kullback-Leibler divergence classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses additive smoothing in both classification methods.

Installation¶ ↑

Add this line to your application’s Gemfile:

gem 'ankusa'

Ensure that if you’re using the HBase, Cassandra, or MongoDB backends that you also add the correct dependency gem to your Gemfile:

gem 'hbaserb'
# or
gem 'cassandra'
# or
gem 'mongo'

If you’re using HBase, make sure the HBase Thrift interface has been started as well.

Basic Usage¶ ↑

Using the naive Bayes classifier:

require 'ankusa'
require 'ankusa/hbase_storage'

# connect to HBase.  Alternatively, just for this test, use in memory storage with
# storage = Ankusa::MemoryStorage.new
storage = Ankusa::HBaseStorage.new 'localhost'
c = Ankusa::NaiveBayesClassifier.new storage

# Each of these calls will return a bag-of-words
# has with stemmed words as keys and counts as values
c.train :spam, "This is some spammy text"
c.train :good, "This is not the bad stuff"

# This will return the most likely class (as symbol)
puts c.classify "This is some spammy text"

# This will return Hash with classes as keys and
# membership probability as values
puts c.classifications "This is some spammy text"

# If you have a large corpus, the probabilities will
# likely all be 0.  In that case, you must use log
# likelihood values
puts c.log_likelihoods "This is some spammy text"

# get a list of all classes
puts c.classnames

# close connection
storage.close

KL Diverence Classifier¶ ↑

There is a Kullback–Leibler divergence classifier as well. KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality). The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes. The class with the shortest “distance” is the best class. You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).

The API is the same as the NaiveBayesClassifier, except rather than calling “classifications” if you want actual numbers you call “distances”.

require 'ankusa'
require 'ankusa/hbase_storage'

# connect to HBase
storage = Ankusa::HBaseStorage.new 'localhost'
c = Ankusa::KLDivergenceClassifier.new storage

# Each of these calls will return a bag-of-words
# has with stemmed words as keys and counts as values
c.train :spam, "This is some spammy text"
c.train :good, "This is not the bad stuff"

# This will return the most likely class (as symbol)
puts c.classify "This is some spammy text"

# This will return Hash with classes as keys and
# distances >= 0 as values
puts c.distances "This is some spammy text"

# get a list of all classes
puts c.classnames

# close connection
storage.close

Storage Methods¶ ↑

Ankusa has a generalized storage interface that has been implemented for HBase, Cassandra, Mongo, single file, and in-memory storage.

Memory storage can be used when you have a very small corpora

require 'ankusa/memory_storage'
storage = Ankusa::MemoryStorage.new

FileSystem storage can be used when you have a very small corpora and want to persist the classification results.

require 'ankusa/file_system_storage'
storage = Ankusa::FileSystemStorage.new '/path/to/file'
# Do classification ...
storage.save

The FileSystem storage does NOT save to the filesystem automatically, the #save method must be invoked to save and persist the results

HBase storage:

require 'ankusa/hbase_storage'
# defaults: host='localhost', port=9090, frequency_tablename="ankusa_word_frequencies", summary_tablename="ankusa_summary"
storage = Ankusa::HBaseStorage.new host, port, frequency_tablename, summary_tablename

For Cassandra storage:

  • You will need Cassandra version 0.7.0-rc2 or greater.

  • You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn’t support table scans.

  • Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: “create keyspace ankusa with replication_factor = 1”. This should be fixed with a new release candidate for Cassandra.

To use the Cassandra storage class:

require 'ankusa/cassandra_storage'
# defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
storage = Ankusa::CassandraStorage.new host, port, keyspace, max_classes

For MongoDB storage:

require 'ankusa/mongo_db_storage'
storage = Ankusa::MongoDbStorage.new :host => "localhost", :port => 27017, :db => "ankusa"
# defaults: :host => "localhost", :port => 27017, :db => "ankusa"
# no default username or password
# you can also use frequency_tablename and summary_tablename options

Running Tests¶ ↑

You can run the tests for any of the four storage methods. For instance, for memory storage:

rake test_memory

For the other methods you will need to edit the file test/config.yml and set the configuration params. Then:

rake test_hbase
# or
rake test_cassandra
# or
rake test_filesystem
#or
rake test_mongo_db

More Repositories

1

kademlia

A DHT in Python using asyncio
Python
791
star
2

gender_detector

Get gender from first name in Ruby.
Ruby
422
star
3

twistar

Twistar is an object-relational mapper (ORM) for Python that uses the Twisted library to provide asynchronous DB interaction.
Python
133
star
4

bandit

A multi-armed bandit optimization framework for Rails
Ruby
129
star
5

rpcudp

Python library for RPC over UDP
Python
89
star
6

mod_auth_openid

mod_auth_openid is an authentication module for the Apache 2 webserver. It handles the functions of an OpenID consumer as specified in the OpenID 2.0 specification.
C++
74
star
7

arrow

Date interaction for Golang, with strftime formatting for time
Go
41
star
8

hbaserb

HBase Thrift interface for Ruby
Ruby
34
star
9

gatling_gun

A Ruby library wrapping SendGrid's Newsletter API.
Ruby
33
star
10

pundit-elixir

Simple authorization helpers for Elixir stucts, like Ruby's Pundit
Elixir
29
star
11

abanalyzer

A/B test analysis library for Ruby - performs Chi-Square tests and G-tests on A/B results - New location for https://github.com/livingsocial/abanalyzer
Ruby
29
star
12

fastimage

Python library that finds the size / type of an image given its URI by fetching as little as needed
Python
28
star
13

StactiveRecord

StactiveRecord is a C++ ORM library designed to make simple database use as simple as possible. It was inspired by Ruby on Rail's Active Record, however, no similar look, feel, or performance is guaranteed. It uses an Object-relational mapping pattern to represent records as objects. It also provides persistent object relationships (one to many, many to many, one to one).
C++
26
star
14

pymur

pymur is a Python interface to The Lemur Toolkit.
C++
19
star
15

fake

Make Python's Fabric act like Ruby's Capistrano
Python
19
star
16

telemetry_metrics_cloudwatch

Provides an AWS CloudFront reporter for Elixir Telemetry.Metrics definitions
Elixir
19
star
17

robostrippy

Python lib to strip websites. Like a robot.
Python
17
star
18

campfirer

Jabber to Campfire gateway
Python
13
star
19

imgproxy

Elixir module to generate imgproxy URLs
Elixir
12
star
20

genderator

A Python library to determine gender based on first name, with i18n support.
Python
12
star
21

toquen

Capistrano + AWS + Chef-Zero
Ruby
11
star
22

gridcli

A command line client for The Grid
Ruby
11
star
23

txyam

Yet Another Memcached (YAM) client for Python Twisted
Python
10
star
24

endon

ActiveRecord type helpers for Elixir's Ecto 3+
Elixir
8
star
25

hubot-aws-sesame

Hubot script to open/close AWS EC2 ports automagically based on chat presence
CoffeeScript
8
star
26

readembedability

Turn unstructured webpages into structured content. Readability + oembed
Python
6
star
27

pubsub.in

Async pubsub broker between twitter/xmpp/email/identi.ca/sms/etc using Python's Twisted
Python
5
star
28

aioipfs-api

IPFS API Bindings for Python 3 using asyncio
Python
5
star
29

clive

Clojure library for interacting with Hive via Thrift
Clojure
4
star
30

pubsubd

Distributed PubSub using Node.js
JavaScript
3
star
31

SassyPy

More featureful CSV handling
Python
3
star
32

aws-sesame

node package to open/close access to servers on AWS by IP like a boss
CoffeeScript
3
star
33

configulator

Generate config files from a template
Ruby
2
star
34

debmeo

oEmbed for Python3
Python
2
star
35

txairbrake

Report exceptions in Twisted code to an airbrake server
Python
2
star
36

grid-plugin-echo

An example plugin for the grid
Ruby
2
star
37

doop

A Hadoop command line utility that acts like a shell.
Shell
2
star
38

ex_aws_s3_crypto

AWS S3 client-side encryption support for Elixir
Elixir
2
star
39

twistler

Controller class extensions for Divmod's Nevow
Python
2
star
40

bizratr

Synthesize business information from many sources
Ruby
1
star
41

Sane.R

A library for making R act sane.
R
1
star
42

robostrippure

Clojure lib to strip websites. Like a robot.
Clojure
1
star
43

blobber

A program that tracks points of light (lasers, LEDs) and projects reactions. The results range from graffiti to various games.
C++
1
star
44

dorsey

A microframework for Go. There are many like it, but this one is mine.
Go
1
star
45

onionvpn

Docker image for a OpenVPN => Tor gateway
Shell
1
star
46

gossipr

Jabber chatroom logger (logs XMPP MUC rooms and provides web interface) using Python Twisted
JavaScript
1
star
47

bmuller.github.io

HTML
1
star
48

oper8r

DEPRECATED - See https://github.com/bmuller/toquen
Ruby
1
star
49

txque

Python library for running asynchronous background jobs using Twisted
Python
1
star
50

grid-plugin-osx-notifier

OSX notifier for new message on the grid http://griddoor.com
Ruby
1
star