Oasic

Search API Notes

2013-11-18T08:41:00.004-08:00

Scenario: You've got a great idea that requires indexing and/or search capabilities well beyond your budget. Where do you go from here?

Thankfully, you have a few options to choose from when deciding how to power your new app. Sadly, you have ONLY a few options to choose from. Indexing and searching the Internet is a monstrous task, which is why this industry is a natural fit for the oligopoly we see today. There are three players in this market that all offer Search APIs, but as of this writing, their products differ considerably.

Yahoo BOSS - http://developer.yahoo.com/boss/search/
If you are looking for something inexpensive, then this is it. They offer a 'limitedweb' search that is slightly smaller and not as fresh, but it's only $0.40/1000 queries, which is half the price of their 'web' offering. Other than the cost savings, this service stinks. Do not use this unless your application allows for a large margin of error and cost is the most important requirement. I've found 3 types of common problems:
- False positives: returning results that do not contain the query. It doesn't matter whether you are using an exact phrase search, boolean operators, etc. Regardless, you will get false positives from time to time.
- False negatives: matching results that are in Yahoo's index fail to be returned sometimes
- Sporadic errors: the errors mentioned above, as well as other outages, occur frequently and randomly. While developing with this API it was very frustrating because it does not return consistent results. The same query will return no results one minute, then many results a minute later. Frustrating.
Bottom line: DO NOT USE ON IMPORTANT WORK

Google - https://developers.google.com/custom-search/

On the other end of the spectrum is the dominant search giant. Their API is high-quality and VERY expensive ($5/1000 queries). Notice that that is more than 10X the cost of Yahoo's limitedweb queries. Nevertheless, the Google results are consistent and of the quality you would expect.
Disadvantages: Besides price, Google's API results often do not match their public search results. If you have a high volume app, the rate limits may be a deal-breaker for you (it was for us).

Microsoft Bing - http://datamarket.azure.com/dataset/8818F55E-2FE5-4CE3-A617-0B8BA8419F65
I'm rarely a fan of anything Microsoft produces, but they are the winner in my evaluation of web search APIs. They have just the right mix of consistency, price, and performance, without the restrictions of Google. They offer unlimited searches at a price that is roughly $1.25/1000 queries. This is 1/4 of Google, but still 3X more than Yahoo's limitedweb. For mission critical apps that can't afford the problems of BOSS, Bing is probably the best choice. Be sure to use the "Web Only" API if you are only using their web search, as it is cheaper than their composite search offering.

Resources for the Internet of Things

2013-08-29T11:27:00.001-07:00

Whether you refer to it as The Internet of Things, the Sensor Revolution, or the Programmable World, here are some potentially useful resources:

OpenRemote
SmartThings - Developers and Store
MakerSwarm
RaZberry - Z-wave for Raspberry Pi
Nimbits
ioBridge - complete stack
NinjaBlocks - sensors, devices, and programmable hub
Limitless LED - programmable lights
WEMO - Belkin home automation devices
Alyt - Android-based home security and automation hub & devices

Microcontrollers & Systems on a Chip

TinkerForge - Stackable, programmable boards. Appears to be simpler than Arduino in many ways.
Arduino - Everyone's favorite open-source microcontroller
BeagleBone Black
Raspberry Pi
Galago -
Intel Galileo - open-source board from Intel

I will try to expand this list over time. If you have more, please add them in the comments. Thanks!

Will a Robot Take My Job?

2013-01-03T12:30:00.000-08:00

With many jobs lost and the economy teetering on recovery, this question couldn't be more relevant. The man vs machine debate has a long history and conjures up images of John Henry racing to his death against a steam hammer. The reality is that machines already do the work of people and will continue to usurp greater and greater responsibilities. If your job is not one of the ones lost, then it is quite easy to see that this expansion should be welcome as it allows for greater societal productivity. After all, do you miss doing dishes by hand? Wish you could wash your laundry in a tub and hang it out to dry? Want to pay greater prices for hand made products? The key is understand which jobs are likely to be filled by machines. The following 5 questions should help.

Is your work highly repetitive?
Specialized, repetitive tasks are easier to automate. It is the human ability to generalize that makes our human "wetware" different than robotic hardware. Maybe you execute the same movements as part of a manual labor position, or maybe you spend the day downloading the same data and plugging it into the same spreadsheet. Whatever the case may be, if you find yourself doing the same tasks over and over, then your job is more likely to be taken by a machine.

Are there many people in your company in the same role as you?
If so, you are a great target for automation. Developing algorithms and/or hardware to replace one person is often not cost effective, but it may be a no-brainer for a company to invest in machines that replace 100 people.

What collar do you wear? White or blue?
Previous generations of machines have found great success replacing manual labor in factories, but the next generation of machines will be more likely to replace office workers -- white collar folks. The reason for this is that massive data and computing power have reached a critical mass allowing difficult "thinking" tasks to be conquered by machines. We now have machines that can dominate Jeopardy, diagnose cancer from a breast scan, and grade essays. This trend will only increase as more data and computing power become available in the years ahead.

On the other hand, most jobs that involve service work or manual labor will be difficult for robots to replace. The reality is that complex analytical thinking will be easier to duplicate than our nerves and muscles. We take for granted how easily we handle sophisticated movements, but it will be many years before robots are equipped with the sensors and actuators comparable to a human's. Until then, don't expect the moving company to send in robots to pack up your house.

Do you interact with people?
Many jobs require a uniquely human touch, something that machines do not offer -- something that machines may never offer. If meaningful interaction with people is an important part of your job, then you are unlikely to be replaced by a machine any time soon. Don't expect to be seeing a salesperson or a robotic shrink in the next century.

The bottom line is that the use of machines will continue to allow society to be more productive, and productivity often means one person doing the work of many. We should embrace this new potential, but also be cognizant of the skill sets most needed in the 21st century.

Quick Solutions: Rails 3, send_data, and garbled PDF output

2012-09-12T13:12:00.000-07:00

This is just a quick blog entry to help anyone experiencing garbled PDF output (in Safari) when using Rails 3's send_data to output dynamically-generated PDF files.

If you are following the documentation, then you are probably outputting something like this:

    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

I don't know if this is specific to Rails 3, but the issue is that the Content-Type header is not being set to 'application/pdf', so setting it explicitly in the response should fix this issue:

    response.headers["Content-Type"]='application/pdf'
    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

Machine Learning: Naive Bayes Classification with Ruby

2012-06-26T20:25:00.001-07:00

Maybe you've wondered, "Where are all the Ruby libraries for Machine Learning and NLP?". Despite Ruby's growing user base and ability to quickly manipulate data and text, there seems to be a dearth of tools for NLP and Machine Learning. One statistical tool that finds itself in the intersection of NLP and Machine Learning is Naive Bayes. [For those already familiar with Naive Bayes, you may wish to skip ahead to the Quickstart section below (although I promise not to be long-winded in my intro).]

This peculiarly-named approach to classification tasks is based on the well-known Bayes' Rule and is used to calculate the probability an instance belongs in a particular class (or category) based on the components of the instance. It is called "Naive" Bayes because of the manner in which it calculates the probabilities. It treats each component as if it is independent from other components, even though this is usually not the case. Suprisingly, Naive Bayes does quite well and is optimal in the case in which the components of an instance actually are independent.

Enough generalities...Naive Bayes is used widely for text classification problems and the "components" that I referenced above are actually tokens -- typically words. Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers. It requires a 2 step process:

1) Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
2) Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class

Quickstart

gem install nbayes

After that, it's time to begin training the classifier:

# create new classifier instance
nbayes = NBayes::Base.new
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )

Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']

But that's not all! I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies. Here we go:

Features

Works with all types of tokens, not just text. Of course, because of this, we leave tokenization up to you.
Disk based persistence
Allows prior distribution on classes to be assumed uniform (optional)
Outputs probabilities, instead of just class w/max probability
Customizable constant value for Laplacian smoothing
Optional and customizable purging of low-frequency tokens (for performance)
Optional binarized mode to reduce the impact of repeated words
Uses log probabilities to avoid underflow

I hope to post examples of these features in action in a future post. Until then, view nbayes_spec.rb for usage.

Solutions to invalid byte sequence in UTF-8 (ArgumentError)

2012-04-16T11:06:00.002-07:00

Let's keep this short and simple. You're reading this because you too are tired of all the new character set issues in Ruby 1.9. This page lists some possible solutions and links to other pages with helpful info on solving "invalid byte sequence in UTF-8 (ArgumentError)".

The Cause
This problem occurs when your application is expecting one character encoding and gets a different one.

Possible Solutions:

1) Explicitly specify the encodings that you expect in the top-level Ruby Encoding class. Put this in a file that will be loaded (e.g., boot.rb for Rails):


Encoding.default_external = 'UTF-8'
Encoding.default_internal = 'UTF-8'

2) Convert the character encoding on a lower level within your application


ascii_str.encode("UTF-8")

More info: http://blog.grayproductions.net/articles/ruby_19s_string

3) Finally, if the text you are working with has a mix of character encodings (occasionally this happens when data is aggregated improperly), then you may need to decide what is the most common character set and convert from that to the new char set (e.g., UTF-8), while ignoring any other unrecognizable characters.

You can do this using iconv from the command line or from within Ruby:
within Ruby:


ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

(source: http://smyck.net/2011/05/13/files-with-mixed-and-invalid-encodings-in-ruby/)

command line:

iconv -f WINDOWS-1252 -t "UTF-8//IGNORE" some_text.txt > some_text.utf8.txt

4) Catch the error and handle it more gracefully:


begin
  # your code
rescue ArgumentError => e   
  print "error: #{e}"
end

Taking a Random Walk with Processing 1.5

2012-01-19T13:02:00.000-08:00

If you've ever dabbled with simulations you have probably come across Processing, the open source environment for animation, interaction, and much more. The Java-ish scripting language allows for quick prototyping and the library of examples allow even beginners (me) to get moving quickly.

To the right is an image I created from a very simple simulation of a "random walk" along the y-axis as time moved along the x-axis. Many have noted the similarity between this process and the peaks and valleys of markets, mountain ranges, and other stochastic processes. If you haven't already, give it a try.

Source for Random Walk:

int x, y, r, middle, randOffset, previousX, previousY;

 void setup() {
   size(900, 900);
   stroke(0);
   background(192, 64, 0);
   x = 0;
   r = 2;
   y = 200;
 } 

 void draw() {
   previousX = x;
   previousY = y;
   x++;
   // y does random up or down of 10 units
   randOffset = 10 - (int)random(21);
   y = y + randOffset;
   //line(150, 25, mouseX, mouseY);
   stroke(0);
   line(previousX, previousY, x, y);
   if(x < 800) save("random_walk.tif");
 }

Rails 3, Ruby 1.9.2, and Character Encoding Nightmares

2011-08-17T13:38:00.000-07:00

Ruby (and Rails) can't take all the blame for this mess. No, it's been present as long as humanity has grasped the social innovation of language. Even if your goals are not quite so high as the Tower of Babel (perhaps just writing a cool web app), it's understandable to feel some sort of divine opposition when dealing with frustrating character set issues. Now that Ruby 1.9 and Rails 3 have become wise to character encodings, the reality of dealing with this messy subject is playing out in web applications and wreaking havoc on many. Having dealt with these problems in many web applications from Java to Ruby over the years, it's easy for me to think that immunity is certain, even deserved, but think again.

The Problem Described
Just recently, a very popular website that I consult on was occasionally throwing the following error that is not uncommon after upgrading to Rails 3 and Ruby 1.9.x:

ActionView::Template::Error: incompatible character encodings: ASCII-8BIT and UTF-8

I've received this type of error (incompatible character encodings) before, but not in a Rails view. Nevertheless, I began by inspecting the usual suspects.

MySQL: SHOW VARIABLES LIKE '%char%';
Ok, all clear there...

application.rb includes config.encoding="UTF-8" -- YES
using 'mysql2' gem -- YES
setting # encoding: utf-8 at the top of any ruby files with UTF-8 string literals -- YES

running 'locale' on my linux system...UTF-8... -- YES
firing up the Rails console and printing out Encoding.default_internal and Encoding.default_external... -- YES

If you reach this point and you are still receiving the problem, then it should be clear that this is probably not going to be solved easily. Fortunately, the stack trace gave me the line in question from my erb file and Airbrake (formerly Hoptoad), provided the specific input that was causing the problem. Using this, I was able to duplicate the error with a Rails functional test and begin looking for solutions.

A string was being output in the view such that changing it to
<%= mystring.force_encoding("ASCII-8BIT") %>
solved the problem, but not really. This caused the error to go away, but only by asking that Rails treat a UTF-8 string as an ASCII string, which could lead to more serious problems. And why would setting a string to ASCII fix the problem in a UTF-8 Rails app? Shouldn't that cause the problem?

After further tinkering, I'll describe my current understanding of what happened; and if you're reading along, hopefully this will be helpful in allowing you to solve a similar problem that you might be having. As I will get to later, a string that was output higher up in the erb file had caused Erb to switch its encoding from UTF-8 to ASCII-8BIT. Perhaps this functionality was added to allow Erb to gracefully handle any character encoding. The problem in this case arose when 2 strings with different character encodings were being output in the same erb file. So, it was simply a matter of tracking down which other string caused the problem.

As it turns out, it was a string populated from an HTTP request using Net::HTTP, which sets the encoding to "ASCII-8BIT", even when the actual encoding is "UTF-8". Since the string was a UTF-8 string mis-labelled as ASCII, the solution was to call "http_string.force_encoding("UTF-8")", which correctly labelled the string. After this change was made, all tests passed. I hope you have the same luck with your related problem. :-)

FeedTools + Ruby 1.9.2 + Rails 3

2011-08-08T12:51:00.000-07:00

During an upgrade of a RoR app, the following was encountered:

ruby -c /usr/local/rvm/gems/ruby-1.9.2-p0/gems/feedtools-0.2.29/lib/feed_tools/helpers/uri_helper.rb
/usr/local/rvm/gems/ruby-1.9.2-p0/gems/feedtools-0.2.29/lib/feed_tools/helpers/uri_helper.rb:43: invalid multibyte char (US-ASCII)
/usr/local/rvm/gems/ruby-1.9.2-p0/gems/feedtools-0.2.29/lib/feed_tools/helpers/uri_helper.rb:43: invalid multibyte char (US-ASCII)
/usr/local/rvm/gems/ruby-1.9.2-p0/gems/feedtools-0.2.29/lib/feed_tools/helpers/uri_helper.rb:43: syntax error, unexpected $end, expecting ')'
if IDN::Idna.toASCII('http://www.詹姆斯.com/') ==
^

To fix, simply add the following to the first line of uri_helper.rb in the feed_tools gem directory:

# encoding: utf-8

This allows the Ruby interpreter to correctly interpret the file as a UTF-8 file.

Using Mysqldump Without Downtime

2011-04-26T07:14:00.000-07:00

Mysqldump is a great utility for creating easy to use, easy to restore mysql database backups, but it can cause downtime if certain precautions are not taken. If you manage MySQL across multiple Linux flavors, various default configurations can cause mysqldump to act in unpredictable ways. Let's look at a couple common problem areas related to mysqldump.

Using the Right Storage Engine
If you're not certain which MySQL storage engine to use, then consider using InnoDB (instead of MyISAM) if you are not already doing so. Using mysqldump to backup a large MyISAM table can cause the entire table to lock until the backup is complete. Even though the long-running read done by mysqldump does not block other reads of the same table, what happens is that an update query issued will cause all subsequent reads to be queued and therefore blocked. So, switching to InnoDB will solve this type of problem due to its usage of row-level locking.

Using --single-transaction
On most of the MySQL installations that I've managed, simply using InnoDB will allow for backups to be created w/o tables locking and queries/updates blocking. However, a recent Ubuntu installation continued to block our Rails application from using the database while mysqldump was running and it was necessary to use the parameter "--single-transaction" like so:

mysqldump -u myuser -pmypass --single-transaction db_name > output.sql

If you have any tips to share, I welcome them in the comments.

To Freemium or Not to Freemium?

2010-10-13T19:49:00.000-07:00

This is a topic that comes up a lot when dealing with Internet businesses. I would say that the topic is interesting, but I actually find it somewhat nauseating as I reflect on the number of online quarrels I've encountered when reading about the Freemium debate. This is why I found it refreshing to come across this nice article from the folks at MailChimp. A beautifully laid out case study on some of the pros and cons of Freemium, and an encouragement to startups that have been around a little too long to be considered a true startup. Personally, the idea of Freemium appeals to me as it gives users a chance to taste your product for free, and gives you (the business) an opportunity to learn from your users and grow your user base. So, "To Freemium or Not to Freemium?" -- you decide.

PayPal Charset IPN Issue

2010-07-01T14:15:00.000-07:00

This post is for anyone experiencing problems with the PayPal verification step in the IPN process. After pulling my hair out trying to figure this out, I found the solution on another blog.

The Problem:
PayPal by default posts its IPNs using the Windows-1252 character encoding (why would they do that?). If you are like any sensible UTF-8-loving developer, then you will find that your postbacks to PayPal are not receiving the VERIFIED status that ensures they are a legitimate IPN. It seems non-ASCII characters in the IPN parameters are not being interpreted correctly by PayPal (because they are expecting Windows-1252 and you are sending UTF-8).

The Fix:
A simple setting in your business PayPal account under "Language Encodings". You will need to select "More Options" to find the screen that allows you to select UTF-8 from the dropdown.

Beware of YSlow and GZip

2010-06-15T13:46:00.000-07:00

Much time can be wasted trying to configure mod_deflate IF it is already properly configured. Although I had previously enabled compression using this simple and powerful Apache module, YSlow was giving me an F for not compressing the page text as well as js and css files. To make a short story even shorter I will simply say this:

YSlow does not accurately detect gzip compression. If you are uncertain, check the response headers for "Vary: Accept Encoding" or use Port80's Compression Check.

Ruby Part-of-Speech Tagger Shootout

2010-06-01T13:31:00.000-07:00

An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing. I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well. Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.

The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:

EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
RubyTagger (rb-brill-tagger): a rule based tagger

Some of the ones excluded that you may be interested in considering:

Mark Watson's Tagger
YamCha - Yet Another Multipurpose CHunk Analyzer

Both EngTagger and RubyTagger provide a simple API and could be easily installed as gems. (NOTE: I believe the RubyTagger gem has a C dependency). In order to get a fairly accurate idea of performance, the gems were tested under 4 scenarios:

1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times

Before looking at the results, let's examine the main portion of the benchmark.rb file:

# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging
Benchmark.bmbm do |b|
   b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging
Benchmark.bmbm do |b|
   b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } }
end


# Scenario 2-A: load the tagger once, then process SHORT text
Benchmark.bmbm do |b|
   b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 2-B: load the tagger once, then process LONG text

Benchmark.bmbm do |b|
   b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } }
end

You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.

The Results
The output of the benchmark script is below (and slightly tidied up).

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger   9.340000   0.040000   9.380000 ( 9.652550)
1-A: rb tagger   22.310000   2.180000 24.490000 ( 25.109431)

Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger 11.880000   0.710000 12.590000 ( 12.940737)
1-B: rb tagger   23.330000   2.350000 25.680000 ( 26.337501)

Scenario 2-A: load the tagger once, then process SHORT text 10 times
                      user     system      total        real
2-A: eng tagger   0.600000   0.000000   0.600000 ( 0.652037)
2-A: rb tagger    1.950000   0.240000   2.190000 ( 2.252128)

Scenario 2-B: load the tagger once, then process LONG text 10 times
                      user     system      total        real
2-B: eng tagger   2.500000   0.260000   2.760000 ( 2.840162)
2-B: rb tagger    2.710000   0.250000   2.960000 ( 3.048174)

It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10). After all, 10 times is rather small:

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger 45.370000   0.310000 45.680000 ( 46.750664)
1-A: rb tagger 117.960000 11.450000 129.410000 (132.715563)

Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger 57.110000   3.610000 60.720000 ( 62.123540)
1-B: rb tagger 121.770000 11.050000 132.820000 (136.239764)

Scenario 2-A: load the tagger once, then process SHORT text 50.times
                      user     system      total        real
2-A: eng tagger   0.790000   0.060000   0.850000 ( 0.896051)
2-A: rb tagger    2.040000   0.230000   2.270000 ( 2.340134)

Scenario 2-B: load the tagger once, then process LONG text 50.times
                      user     system      total        real
2-B: eng tagger   9.340000   0.850000 10.190000 ( 10.528600)
2-B: rb tagger    6.000000   0.470000   6.470000 ( 6.636378)

With 50 iterations, the EngTagger outperforms RubyTagger on all but the last task. Interestingly, the last scenario is the one that most closely matches the real-world application being built.

Lessons Learned

EngTagger and RubyTagger perform optimally under different conditions
Benchmarks should mimic your application's usage as closely as possible

Statistical Analysis using Ruby

2010-05-25T14:10:00.000-07:00

Having migrated to Ruby from a Java background, I sometimes find myself longing for the vast and robust libraries that exist in the Java ecosystem. While preparing for a natural language processing task, I considered revisiting the Weka software that I had used in the past for machine learning and statistical analysis. But, it sure would be nice if something existed in Ruby for this sort of work. Enter Statsample.

While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended: Multiple Regression and Pearson r Correlation Coefficient. Here are some examples of both at work:

    # Calculate correlation coefficient
    b=(1..100).collect { rand(100)}.to_scale
    Statsample::Bivariate.pearson(a,b)
    # Multiple Regression

    a=1000.times.collect {rand}.to_scale
    b=1000.times.collect {rand}.to_scale
    c=1000.times.collect {rand}.to_scale
    ds={'a'=>a,'b'=>b,'c'=>c}.to_dataset
    ds['y']=ds.collect{|row| row['a']*5+row['b']*3+row['c']*2+rand()}
    lr=Statsample::Regression.multiple(ds,'y')
    puts lr.summary
    Summary for regression of a,b,c over y
    *************************************************************
    Engine: Statsample::Regression::Multiple::AlglibEngine
    Cases(listwise)=1000(1000)
    r=0.986
    r2=0.973
    Equation=0.504+5.011a + 2.995b + 1.988c

Source: http://ruby-statsample.rubyforge.org/

Play Video Games and Save the World

2010-03-17T07:45:00.000-07:00

I don't have as much time for video games as I once did, but it's nice to hear that video games will solve the world's problems. This video from TED.com points to additional emerging research that video games do not rot the mind as once thought:

http://www.ted.com/talks/jane_mcgonigal_gaming_can_make_a_better_world.html

Controlling Web-Bots with CurbIt

2010-03-08T11:01:00.000-08:00

Whether you have a content-heavy site or a very application-centric website, bots and harvesters can wreak havoc by eating up CPU cycles, memory, and system resources. These ubiquitous pests will gladly retrieve all of your site's content with little regard for copyright laws or your terms of service.

One helpful tool for limiting many types of harmful bots and crawlers is a Ruby gem or plugin called CurbIt. CurbIt adds application level rate limiting to your Rails app. I recently had the pleasure of utilizing CurbIt on the Paper Rater website to limit the number of submissions. This helps us to ensure that humans are submitting documents, but without bothering our users with a CAPTCHA.

Example usage from the CurbIt github page:

class InvitesController < ApplicationController
    def invite
      # invite logic...
    end

    rate_limit :invite, :max_calls => 2, :time_limit => 30.seconds, :wait_time => 1.minute
  end

Should You Move to the Cloud?

2010-02-18T06:54:00.000-08:00

At Oasic we host a lot of websites and one of the most boring tasks involves setting up new servers, either for a new website or as part of a migration/expansion. It's one of my least favorite parts of the job. With many of the cloud offerings out there (Heroku, Google AppEngine, EngineYard, etc.), I've begun to dip my toes in the water. But why not just dive in?

Most Apps are NOT Designed for the Cloud

The concept of cloud hosting, along with its advantages and restrictions, is a relatively new phenomenon and most application frameworks, libraries, processes, and tools are not developed with the cloud in mind. Heroku's simple restriction of a read-only filesystem for all but a couple directories, means that you have to work around this. Restrictions on background processes means that you need to work with their related add-on, or come up with another solution. It begins to feel like deployment is a series of workarounds, and I'm not certain that's any better than setting up a new server.

Middle Ground

VPS providers like Amazon EC2, Slicehost, and WebbyNode provide a middle ground between a full cloud solution and hosting on dedicated servers. They allow instances to be setup more easily, scaled up and down, and backed up at nice prices and good uptime. And, they allow full access to the server. You can write to the filesystems, access databases directly, run background tasks, poke around on the server via SSH, create shell scripts, etc.

Further Reading

Moving to a full cloud stack requires a loss of flexibility that many businesses can't allow. Here is a good case study about GitHub's move out of the cloud:

http://github.com/blog/493-github-is-moving-to-rackspace

And, of course, there are many businesses that benefit from the advantages of moving the cloud: inherit load-balancing, fault-tolerance, pay-per-usage, ease of deployment, etc.

So, review your options, consider your needs, and make the best choice for your application's hosting.

Blogger Beats Typo

2010-02-16T07:58:00.000-08:00

So I set out to create a blog for my website and spent some considerable time looking at possible options. Most of my development is done in Ruby these days, so I wanted to stick with a Ruby package if possible. These were my finalists:

Radiant CMS: not a pure blogging solution, but a full-featured, extensible CMS with blogging modules. It would allow me to add new functionality to my programmer-heart's delight.
Typo: the oldest, tried-and-true Ruby blogging system out there. This heavyweight is a non-nonsense blogging machine.
Mephisto: a leaner blogging app than Typo, but still with nice features.

And the winner is...

Typo -- at least it seemed. I used the typo gem to install my blog. Added database permissions and began configuring things. I edited the default Hello World post that was auto-created. I was on my way until I created my first new post. No visual errors appeared, but the posts were not being saved. What could be going on?

I often enjoy a good troubleshooting session, but not this time around. I had 3 blogs needing to be setup (for various sites) and I didn't want to deal with these issues. I decided to try out Google's hosted Blogger and the rest is history!

Although there is no Markdown, Textile, or other markup suppport, it does include a WYSIWYG editor with full html tags. Here's some other benefits:

use your own domain for free (or subdomain as I have done with this blog)
fully customizable layout
API for programmatic access
3 minute setup of a new blog
Google handles everything for you (1 less component to manage)
did I mention it was free?