Wednesday, September 12, 2012

Quick Solutions: Rails 3, send_data, and garbled PDF output

This is just a quick blog entry to help anyone experiencing garbled PDF output (in Safari) when using Rails 3's send_data to output dynamically-generated PDF files. 

If you are following the documentation, then you are probably outputting something like this:

    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

I don't know if this is specific to Rails 3, but the issue is that the Content-Type header is not being set to 'application/pdf', so setting it explicitly in the response should fix this issue:

    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

Tuesday, June 26, 2012

Machine Learning: Naive Bayes Classification with Ruby

Maybe you've wondered, "Where are all the Ruby libraries for Machine Learning and NLP?".  Despite Ruby's growing user base and ability to quickly manipulate data and text, there seems to be a dearth of tools for NLP and Machine Learning.  One statistical tool that finds itself in the intersection of NLP and Machine Learning is Naive Bayes.  [For those already familiar with Naive Bayes, you may wish to skip ahead to the Quickstart section below (although I promise not to be long-winded in my intro).]    

This peculiarly-named approach to classification tasks is based on the well-known Bayes' Rule and is used to calculate the probability an instance belongs in a particular class (or category) based on the components of the instance.   It is called "Naive" Bayes because of the manner in which it calculates the probabilities.  It treats each component as if it is independent from other components, even though this is usually not the case.  Suprisingly, Naive Bayes does quite well and is optimal in the case in which the components of an instance actually are independent. 

Enough generalities...Naive Bayes is used widely for text classification problems and the "components" that I referenced above are actually tokens -- typically words.  Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers.  It requires a 2 step process:

1) Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
2) Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class


gem install nbayes

After that, it's time to begin training the classifier:
# create new classifier instance
nbayes =
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )

Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']

But that's not all!  I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies.  Here we go:

  • Works with all types of tokens, not just text.  Of course, because of this, we leave tokenization up to you.
  • Disk based persistence
  • Allows prior distribution on classes to be assumed uniform (optional)
  • Outputs probabilities, instead of just class w/max probability
  • Customizable constant value for Laplacian smoothing
  • Optional and customizable purging of low-frequency tokens (for performance)
  • Optional binarized mode to reduce the impact of repeated words
  • Uses log probabilities to avoid underflow

I hope to post examples of these features in action in a future post.  Until then, view nbayes_spec.rb for usage.

Monday, April 16, 2012

Solutions to invalid byte sequence in UTF-8 (ArgumentError)

Let's keep this short and simple. You're reading this because you too are tired of all the new character set issues in Ruby 1.9. This page lists some possible solutions and links to other pages with helpful info on solving "invalid byte sequence in UTF-8 (ArgumentError)".  

The Cause
This problem occurs when your application is expecting one character encoding and gets a different one.  

Possible Solutions:

1) Explicitly specify the encodings that you expect in the top-level Ruby Encoding class. Put this in a file that will be loaded (e.g., boot.rb for Rails):
Encoding.default_external = 'UTF-8' Encoding.default_internal = 'UTF-8'

2) Convert the character encoding on a lower level within your application
More info:

3) Finally, if the text you are working with has a mix of character encodings (occasionally this happens when data is aggregated improperly), then you may need to decide what is the most common character set and convert from that to the new char set (e.g., UTF-8), while ignoring any other unrecognizable characters.

You can do this using iconv from the command line or from within Ruby:
within Ruby:
ic ='UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string)

 command line:
iconv -f WINDOWS-1252 -t "UTF-8//IGNORE" some_text.txt > some_text.utf8.txt

4) Catch the error and handle it more gracefully:
begin # your code rescue ArgumentError => e print "error: #{e}" end

Thursday, January 19, 2012

Taking a Random Walk with Processing 1.5

If you've ever dabbled with simulations you have probably come across Processing, the open source environment for animation, interaction, and much more.  The Java-ish scripting language allows for quick prototyping and the library of examples allow even beginners (me) to get moving quickly.

To the right is an image I created from a very simple simulation of a "random walk" along the y-axis as time moved along the x-axis.  Many have noted the similarity between this process and the peaks and valleys of markets, mountain ranges, and other stochastic processes.  If you haven't already, give it a try.

Source for Random Walk:
int x, y, r, middle, randOffset, previousX, previousY;

 void setup() {
   size(900, 900);
   background(192, 64, 0);
   x = 0;
   r = 2;
   y = 200;

 void draw() {
   previousX = x;
   previousY = y;
   // y does random up or down of 10 units
   randOffset = 10 - (int)random(21);
   y = y + randOffset;
   //line(150, 25, mouseX, mouseY);
   line(previousX, previousY, x, y);
   if(x < 800) save("random_walk.tif");