Wednesday, October 13, 2010

To Freemium or Not to Freemium?

Free: The Future of a Radical Price
This is a topic that comes up a lot when dealing with Internet businesses.  I would say that the topic is interesting, but I actually find it somewhat nauseating as I reflect on the number of online quarrels I've encountered when reading about the Freemium debate.  This is why I found it refreshing to come across this nice article from the folks at MailChimp.  A beautifully laid out case study on some of the pros and cons of Freemium, and an encouragement to startups that have been around a little too long to be considered a true startup.  Personally, the idea of Freemium appeals to me as it gives users a chance to taste your product for free, and gives you (the business) an opportunity to learn from your users and grow your user base.  So, "To Freemium or Not to Freemium?" -- you decide.

Thursday, July 1, 2010

PayPal Charset IPN Issue

This post is for anyone experiencing problems with the PayPal verification step in the IPN process.  After pulling my hair out trying to figure this out, I found the solution on another blog.

The Problem:
PayPal by default posts its IPNs using the Windows-1252 character encoding (why would they do that?).  If you are like any sensible UTF-8-loving developer, then you will find that your postbacks to PayPal are not receiving the VERIFIED status that ensures they are a legitimate IPN.  It seems non-ASCII characters in the IPN parameters are not being interpreted correctly by PayPal (because they are expecting Windows-1252 and you are sending UTF-8).

The Fix:
A simple setting in your business PayPal account under "Language Encodings".  You will need to select "More Options" to find the screen that allows you to select UTF-8 from the dropdown.

Tuesday, June 15, 2010

Beware of YSlow and GZip


Much time can be wasted trying to configure mod_deflate IF it is already properly configured.  Although I had previously enabled compression using this simple and powerful Apache module, YSlow was giving me an F for not compressing the page text as well as js and css files.  To make a short story even shorter I will simply say this: 

YSlow does not accurately detect gzip compression.  If you are uncertain, check the response headers for "Vary: Accept Encoding" or use Port80's Compression Check.

Tuesday, June 1, 2010

Ruby Part-of-Speech Tagger Shootout

 An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing.  I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well.  Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.



The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
  1. EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
  2. RubyTagger (rb-brill-tagger): a rule based tagger
Foundations of Statistical Natural Language ProcessingSome of the ones excluded that you may be interested in considering:
Both EngTagger and RubyTagger provide a simple API and could be easily installed as gems.  (NOTE: I believe the RubyTagger gem has a C dependency).  In order to get a fairly accurate idea of performance, the gems were tested under 4 scenarios:

1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times

Before looking at the results, let's examine the main portion of the benchmark.rb file:
# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging
Benchmark.bmbm do |b|
   b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging
Benchmark.bmbm do |b|
   b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } }
end


# Scenario 2-A: load the tagger once, then process SHORT text
Benchmark.bmbm do |b|
   b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 2-B: load the tagger once, then process LONG text

Benchmark.bmbm do |b|
   b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } }
end

You may notice the extra processing of the EngTagger output:
    engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.

The Results
The output of the benchmark script is below (and slightly tidied up).


Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger   9.340000   0.040000   9.380000 (  9.652550)
1-A: rb tagger   22.310000   2.180000  24.490000 ( 25.109431)


Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger  11.880000   0.710000  12.590000 ( 12.940737)
1-B: rb tagger   23.330000   2.350000  25.680000 ( 26.337501)

Scenario 2-A: load the tagger once, then process SHORT text 10 times
                      user     system      total        real
2-A: eng tagger   0.600000   0.000000   0.600000 (  0.652037)
2-A: rb tagger    1.950000   0.240000   2.190000 (  2.252128)


Scenario 2-B: load the tagger once, then process LONG text 10 times
                      user     system      total        real
2-B: eng tagger   2.500000   0.260000   2.760000 (  2.840162)
2-B: rb tagger    2.710000   0.250000   2.960000 (  3.048174)
It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10).  After all, 10 times is rather small:

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger  45.370000   0.310000  45.680000 ( 46.750664)
1-A: rb tagger  117.960000  11.450000 129.410000 (132.715563)


Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger  57.110000   3.610000  60.720000 ( 62.123540)
1-B: rb tagger  121.770000  11.050000 132.820000 (136.239764)


Scenario 2-A: load the tagger once, then process SHORT text 50.times
                      user     system      total        real
2-A: eng tagger   0.790000   0.060000   0.850000 (  0.896051)
2-A: rb tagger    2.040000   0.230000   2.270000 (  2.340134)


Scenario 2-B: load the tagger once, then process LONG text 50.times
                      user     system      total        real
2-B: eng tagger   9.340000   0.850000  10.190000 ( 10.528600)
2-B: rb tagger    6.000000   0.470000   6.470000 (  6.636378)
With 50 iterations, the EngTagger outperforms RubyTagger on all but the last task.  Interestingly, the last scenario is the one that most closely matches the real-world application being built.

Lessons Learned
  1. EngTagger and RubyTagger perform optimally under different conditions
  2. Benchmarks should mimic your application's usage as closely as possible

Tuesday, May 25, 2010

Statistical Analysis using Ruby

Having migrated to Ruby from a Java background, I sometimes find myself longing for the vast and robust libraries that exist in the Java ecosystem.  While preparing for a natural language processing task, I considered revisiting the Weka software that I had used in the past for machine learning and statistical analysis.  But, it sure would be nice if something existed in Ruby for this sort of work.  Enter Statsample.


While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended:  Multiple Regression and Pearson r Correlation Coefficient.  Here are some examples of both at work:

    # Calculate correlation coefficient
    b=(1..100).collect { rand(100)}.to_scale
    Statsample::Bivariate.pearson(a,b)
    # Multiple Regression

    a=1000.times.collect {rand}.to_scale
    b=1000.times.collect {rand}.to_scale
    c=1000.times.collect {rand}.to_scale
    ds={'a'=>a,'b'=>b,'c'=>c}.to_dataset
    ds['y']=ds.collect{|row| row['a']*5+row['b']*3+row['c']*2+rand()}
    lr=Statsample::Regression.multiple(ds,'y')
    puts lr.summary
    Summary for regression of a,b,c over y
    *************************************************************
    Engine: Statsample::Regression::Multiple::AlglibEngine
    Cases(listwise)=1000(1000)
    r=0.986
    r2=0.973
    Equation=0.504+5.011a + 2.995b + 1.988c
  Source: http://ruby-statsample.rubyforge.org/

Wednesday, March 17, 2010

Play Video Games and Save the World

Call of Duty: Modern Warfare 2
I don't have as much time for video games as I once did, but it's nice to hear that video games will solve the world's problems.  This video from TED.com points to additional emerging research that video games do not rot the mind as once thought:

http://www.ted.com/talks/jane_mcgonigal_gaming_can_make_a_better_world.html

Monday, March 8, 2010

Controlling Web-Bots with CurbIt

Whether you have a content-heavy site or a very application-centric website, bots and harvesters can wreak havoc by eating up CPU cycles, memory, and system resources.  These ubiquitous pests will gladly retrieve all of your site's content with little regard for copyright laws or your terms of service.

One helpful tool for limiting many types of harmful bots and crawlers is a Ruby gem or plugin called CurbIt.  CurbIt adds application level rate limiting to your Rails app.  I recently had the pleasure of utilizing CurbIt on the Paper Rater website to limit the number of submissions.  This helps us to ensure that humans are submitting documents, but without bothering our users with a CAPTCHA.

Example usage from the CurbIt github page:


class InvitesController < ApplicationController
    def invite
      # invite logic...
    end

    rate_limit :invite, :max_calls => 2, :time_limit => 30.seconds, :wait_time => 1.minute
  end

Thursday, February 18, 2010

Should You Move to the Cloud?

Cloud Computing For DummiesAt Oasic we host a lot of websites and one of the most boring tasks involves setting up new servers, either for a new website or as part of a migration/expansion.  It's one of my least favorite parts of the job.  With many of the cloud offerings out there (Heroku, Google AppEngine, EngineYard, etc.), I've begun to dip my toes in the water.  But why not just dive in?

Most Apps are NOT Designed for the Cloud

The concept of cloud hosting, along with its advantages and restrictions, is a relatively new phenomenon and most application frameworks, libraries, processes, and tools are not developed with the cloud in mind.  Heroku's simple restriction of a read-only filesystem for all but a couple directories, means that you have to work around this.  Restrictions on background processes means that you need to work with their related add-on, or come up with another solution.  It begins to feel like deployment is a series of workarounds, and I'm not certain that's any better than setting up a new server.

Middle Ground

VPS providers like Amazon EC2, Slicehost, and WebbyNode provide a middle ground between a full cloud solution and hosting on dedicated servers.  They allow instances to be setup more easily, scaled up and down, and backed up at nice prices and good uptime.  And, they allow full access to the server.  You can write to the filesystems, access databases directly, run background tasks, poke around on the server via SSH, create shell scripts, etc.

Further Reading

Moving to a full cloud stack requires a loss of flexibility that many businesses can't allow.  Here is a good case study about GitHub's move out of the cloud:

http://github.com/blog/493-github-is-moving-to-rackspace

And, of course, there are many businesses that benefit from the advantages of moving the cloud: inherit load-balancing, fault-tolerance, pay-per-usage, ease of deployment, etc.

So, review your options, consider your needs, and make the best choice for your application's hosting.

Tuesday, February 16, 2010

Blogger Beats Typo

Blogger: Beyond the Basics: Customize and promote your blog with original templates, analytics, advertising, and SEO (From Technologies to Solutions)So I set out to create a blog for my website and spent some considerable time looking at possible options. Most of my development is done in Ruby these days, so I wanted to stick with a Ruby package if possible. These were my finalists:
  • Radiant CMS:  not a pure blogging solution, but a full-featured, extensible CMS with blogging modules.  It would allow me to add new functionality to my programmer-heart's delight.
  • Typo:  the oldest, tried-and-true Ruby blogging system out there.  This heavyweight is a non-nonsense blogging machine.
  • Mephisto:  a leaner blogging app than Typo, but still with nice features.
And the winner is...

Typo --  at least it seemed.  I used the typo gem to install my blog.  Added database permissions and began configuring things.  I edited the default Hello World post that was auto-created.  I was on my way until I created my first new post.  No visual errors appeared, but the posts were not being saved.  What could be going on?

I often enjoy a good troubleshooting session, but not this time around.  I had 3 blogs needing to be setup (for various sites) and I didn't want to deal with these issues.  I decided to try out Google's hosted Blogger and the rest is history!

Although there is no Markdown, Textile, or other markup suppport, it does include a WYSIWYG editor with full html tags.  Here's some other benefits:
  • use your own domain for free (or subdomain as I have done with this blog)
  • fully customizable layout
  • API for programmatic access
  • 3 minute setup of a new blog
  • Google handles everything for you (1 less component to manage)
  • did I mention it was free?