Tuesday, June 26, 2012

Machine Learning: Naive Bayes Classification with Ruby

Maybe you've wondered, "Where are all the Ruby libraries for Machine Learning and NLP?".  Despite Ruby's growing user base and ability to quickly manipulate data and text, there seems to be a dearth of tools for NLP and Machine Learning.  One statistical tool that finds itself in the intersection of NLP and Machine Learning is Naive Bayes.  [For those already familiar with Naive Bayes, you may wish to skip ahead to the Quickstart section below (although I promise not to be long-winded in my intro).]    

This peculiarly-named approach to classification tasks is based on the well-known Bayes' Rule and is used to calculate the probability an instance belongs in a particular class (or category) based on the components of the instance.   It is called "Naive" Bayes because of the manner in which it calculates the probabilities.  It treats each component as if it is independent from other components, even though this is usually not the case.  Suprisingly, Naive Bayes does quite well and is optimal in the case in which the components of an instance actually are independent. 

Enough generalities...Naive Bayes is used widely for text classification problems and the "components" that I referenced above are actually tokens -- typically words.  Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers.  It requires a 2 step process:

1) Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
2) Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class


gem install nbayes

After that, it's time to begin training the classifier:
# create new classifier instance
nbayes = NBayes::Base.new
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )

Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']

But that's not all!  I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies.  Here we go:

  • Works with all types of tokens, not just text.  Of course, because of this, we leave tokenization up to you.
  • Disk based persistence
  • Allows prior distribution on classes to be assumed uniform (optional)
  • Outputs probabilities, instead of just class w/max probability
  • Customizable constant value for Laplacian smoothing
  • Optional and customizable purging of low-frequency tokens (for performance)
  • Optional binarized mode to reduce the impact of repeated words
  • Uses log probabilities to avoid underflow

I hope to post examples of these features in action in a future post.  Until then, view nbayes_spec.rb for usage.