• Stars
    star
    169
  • Rank 224,453 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 10 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

yet another general purpose naive bayesian classifier.

Naive Bayesian Classifier

yet another general purpose Naive Bayesian classifier.

##Installation You can install this package using the following pip command:

$ sudo pip install naiveBayesClassifier

##Example

"""
Suppose you have some texts of news and know their categories.
You want to train a system with this pre-categorized/pre-classified 
texts. So, you have better call this data your training set.
"""
from naiveBayesClassifier import tokenizer
from naiveBayesClassifier.trainer import Trainer
from naiveBayesClassifier.classifier import Classifier

newsTrainer = Trainer(tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))

# You need to train the system passing each text one by one to the trainer module.
newsSet =[
    {'text': 'not to eat too much is not enough to lose weight', 'category': 'health'},
    {'text': 'Russia is trying to invade Ukraine', 'category': 'politics'},
    {'text': 'do not neglect exercise', 'category': 'health'},
    {'text': 'Syria is the main issue, Obama says', 'category': 'politics'},
    {'text': 'eat to lose weight', 'category': 'health'},
    {'text': 'you should not eat much', 'category': 'health'}
]

for news in newsSet:
    newsTrainer.train(news['text'], news['category'])

# When you have sufficient trained data, you are almost done and can start to use
# a classifier.
newsClassifier = Classifier(newsTrainer.data, tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))

# Now you have a classifier which can give a try to classifiy text of news whose
# category is unknown, yet.
unknownInstance = "Even if I eat too much, is not it possible to lose some weight"
classification = newsClassifier.classify(unknownInstance)

# the classification variable holds the possible categories sorted by 
# their probablity value
print classification

Note: Definitely you will need much more training data than the amount in the above example. Really, a few lines of text like in the example is out of the question to be sufficient training set.

##What is the Naive Bayes Theorem and Classifier It is needless to explain everything once again here. Instead, one of the most eloquent explanations is quoted here.

The following explanation is quoted from another Bayes classifier which is written in Go.

BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set of classes (e.g. categories) C := {C_1, ..., C_n}, and a document D consisting of words D := {W_1, ..., W_k}. We wish to ascertain the probability that the document belongs to some class C_j given some set of training data associating documents and classes.

By Bayes' Theorem, we have that

P(C_j|D) = P(D|C_j)*P(C_j)/P(D).

The LHS is the probability that the document belongs to class C_j given the document itself (by which is meant, in practice, the word frequencies occurring in this document), and our program will calculate this probability for each j and spit out the most likely class for this document.

P(C_j) is referred to as the "prior" probability, or the probability that a document belongs to C_j in general, without seeing the document first. P(D|C_j) is the probability of seeing such a document, given that it belongs to C_j. Here, by assuming that words appear independently in documents (this being the "naive" assumption), we can estimate

P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j)

where P(W_i|C_j) is the probability of seeing the given word in a document of the given class. Finally, P(D) can be seen as merely a scaling factor and is not strictly relevant to classificiation, unless you want to normalize the resulting scores and actually see probabilities. In this case, note that

P(D) = SUM_j(P(D|C_j)*P(C_j))

One practical issue with performing these calculations is the possibility of float64 underflow when calculating P(D|C_j), as individual word probabilities can be arbitrarily small, and a document can have an arbitrarily large number of them. A typical method for dealing with this case is to transform the probability to the log domain and perform additions instead of multiplications:

log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j))

where i = 1, ..., k. Note that by doing this, we are discarding the scaling factor P(D) and our scores are no longer probabilities; however, the monotonic relationship of the scores is preserved by the log function.

If you are very curious about Naive Bayes Theorem, you may find the following list helpful:

#Improvements This classifier uses a very simple tokenizer which is just a module to split sentences into words. If your training set is large, you can rely on the available tokenizer, otherwise you need to have a better tokenizer specialized to the language of your training texts.

TODO

  • inline docs
  • unit-tests

AUTHORS

  • Mustafa Atik @muatik
  • Nejdet Yucesoy @nejdetckenobi

More Repositories

1

flask-profiler

a flask profiler which watches endpoint calls and tries to make some analysis.
Python
739
star
2

openmp-examples

openmp examples
C++
113
star
3

genderizer

Genderizer is a language independent module which tries to detect gender by looking given first names and/or analyzing sample texts.
Python
64
star
4

my-coding-challenges

my solutions to the problems presented in https://github.com/donnemartin/interactive-coding-challenges
Jupyter Notebook
24
star
5

machine-learning-examples

Jupyter Notebook
20
star
6

mysqlDiff

displays schema differences between two versions of same mysql database (I am embarrassed of this code.)
PHP
13
star
7

healthier

health tracker application - Mustafa Atik - SWE 573 - Fall'16 homework project
HTML
12
star
8

kelimeci

yabancı dildeki kelimelerin öğrenilmesini kolaylaştıran araç
PHP
9
star
9

financial-crawler

clients for crude oil, stock price, parrities and currencies
Python
7
star
10

sitemapper

Python
4
star
11

spring-playground

my spring examples
Java
4
star
12

restfulAPIconsole

This is an API documantation and client tool in Javascript for RESTfulAPI web services.
JavaScript
3
star
13

phpMysqlBackup

PHP
3
star
14

dahi

experimental dialogue engine.
Jupyter Notebook
3
star
15

wordnetCrawler2

tdk, zargan, seslisozluk sitelerindeki tüm kelimeleri en küçük detayına kadar çeken örümböcük
PHP
3
star
16

logmon

yet another web based text log monitor tool
PHP
2
star
17

trendAnalysis

detects emerging events via text mining in social media.
PHP
2
star
18

americanhistory

Java
1
star
19

yedekleyiciler

veritabanı ve dosya yedeklemeyi sğalayan betiklerdir, sunucu yöneticileri için nimet.
PHP
1
star