Tuesday, June 26, 2012

Machine Learning: Naive Bayes Classification with Ruby

Maybe you've wondered, "Where are all the Ruby libraries for Machine Learning and NLP?".  Despite Ruby's growing user base and ability to quickly manipulate data and text, there seems to be a dearth of tools for NLP and Machine Learning.  One statistical tool that finds itself in the intersection of NLP and Machine Learning is Naive Bayes.  [For those already familiar with Naive Bayes, you may wish to skip ahead to the Quickstart section below (although I promise not to be long-winded in my intro).]    

This peculiarly-named approach to classification tasks is based on the well-known Bayes' Rule and is used to calculate the probability an instance belongs in a particular class (or category) based on the components of the instance.   It is called "Naive" Bayes because of the manner in which it calculates the probabilities.  It treats each component as if it is independent from other components, even though this is usually not the case.  Suprisingly, Naive Bayes does quite well and is optimal in the case in which the components of an instance actually are independent. 

Enough generalities...Naive Bayes is used widely for text classification problems and the "components" that I referenced above are actually tokens -- typically words.  Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers.  It requires a 2 step process:

1) Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
2) Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class

Quickstart

gem install nbayes

After that, it's time to begin training the classifier:
# create new classifier instance
nbayes = NBayes::Base.new
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )


Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']


But that's not all!  I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies.  Here we go:

Features
  • Works with all types of tokens, not just text.  Of course, because of this, we leave tokenization up to you.
  • Disk based persistence
  • Allows prior distribution on classes to be assumed uniform (optional)
  • Outputs probabilities, instead of just class w/max probability
  • Customizable constant value for Laplacian smoothing
  • Optional and customizable purging of low-frequency tokens (for performance)
  • Optional binarized mode to reduce the impact of repeated words
  • Uses log probabilities to avoid underflow

I hope to post examples of these features in action in a future post.  Until then, view nbayes_spec.rb for usage.



7 comments:

  1. Do you have to train before classification every time? Does/Can a trained classifier store it's data in Yaml or Memory.. or?

    ReplyDelete
    Replies
    1. No, an instance that is trained can be persisted to a YAML file using the #dump(yml_file) method and then instantiated using the from(yml_file) class method. See nbayes_spec.rb in the source code for example usage.

      Delete
  2. @J: You can save your trained model using the #dump(yml_file), and load using the from(yml_file). Only YAML is supported at this time, but adding other formats will be easy.

    ReplyDelete
  3. nice work, thanks

    ReplyDelete
  4. I guess you can serialize an object in a database to keep the training in memory

    ReplyDelete