Wednesday, October 13, 2010
To Freemium or Not to Freemium?
This is a topic that comes up a lot when dealing with Internet businesses. I would say that the topic is interesting, but I actually find it somewhat nauseating as I reflect on the number of online quarrels I've encountered when reading about the Freemium debate. This is why I found it refreshing to come across this nice article from the folks at MailChimp. A beautifully laid out case study on some of the pros and cons of Freemium, and an encouragement to startups that have been around a little too long to be considered a true startup. Personally, the idea of Freemium appeals to me as it gives users a chance to taste your product for free, and gives you (the business) an opportunity to learn from your users and grow your user base. So, "To Freemium or Not to Freemium?" -- you decide.
Thursday, July 1, 2010
PayPal Charset IPN Issue
This post is for anyone experiencing problems with the PayPal verification step in the IPN process. After pulling my hair out trying to figure this out, I found the solution on another blog.
The Problem:
PayPal by default posts its IPNs using the Windows-1252 character encoding (why would they do that?). If you are like any sensible UTF-8-loving developer, then you will find that your postbacks to PayPal are not receiving the VERIFIED status that ensures they are a legitimate IPN. It seems non-ASCII characters in the IPN parameters are not being interpreted correctly by PayPal (because they are expecting Windows-1252 and you are sending UTF-8).
The Fix:
A simple setting in your business PayPal account under "Language Encodings". You will need to select "More Options" to find the screen that allows you to select UTF-8 from the dropdown.
The Problem:
PayPal by default posts its IPNs using the Windows-1252 character encoding (why would they do that?). If you are like any sensible UTF-8-loving developer, then you will find that your postbacks to PayPal are not receiving the VERIFIED status that ensures they are a legitimate IPN. It seems non-ASCII characters in the IPN parameters are not being interpreted correctly by PayPal (because they are expecting Windows-1252 and you are sending UTF-8).
The Fix:
A simple setting in your business PayPal account under "Language Encodings". You will need to select "More Options" to find the screen that allows you to select UTF-8 from the dropdown.
Tuesday, June 15, 2010
Beware of YSlow and GZip
Much time can be wasted trying to configure mod_deflate IF it is already properly configured. Although I had previously enabled compression using this simple and powerful Apache module, YSlow was giving me an F for not compressing the page text as well as js and css files. To make a short story even shorter I will simply say this:
YSlow does not accurately detect gzip compression. If you are uncertain, check the response headers for "Vary: Accept Encoding" or use Port80's Compression Check.
Tuesday, June 1, 2010
Ruby Part-of-Speech Tagger Shootout
An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing. I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well. Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.
The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times
Before looking at the results, let's examine the main portion of the benchmark.rb file:
You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.
The Results
The output of the benchmark script is below (and slightly tidied up).
Lessons Learned
The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
- EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
- RubyTagger (rb-brill-tagger): a rule based tagger
- Mark Watson's Tagger
- YamCha - Yet Another Multipurpose CHunk Analyzer
1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times
Before looking at the results, let's examine the main portion of the benchmark.rb file:
# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging Benchmark.bmbm do |b| b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } } end # Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging Benchmark.bmbm do |b| b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } } end # Scenario 2-A: load the tagger once, then process SHORT text Benchmark.bmbm do |b| b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } } end # Scenario 2-B: load the tagger once, then process LONG text Benchmark.bmbm do |b| b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } } end
You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.
The Results
The output of the benchmark script is below (and slightly tidied up).
It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10). After all, 10 times is rather small:
Scenario 1-A: load the tagger each time before processing SHORT text
user system total real
1-A: eng tagger 9.340000 0.040000 9.380000 ( 9.652550)
1-A: rb tagger 22.310000 2.180000 24.490000 ( 25.109431)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 11.880000 0.710000 12.590000 ( 12.940737)
1-B: rb tagger 23.330000 2.350000 25.680000 ( 26.337501)
Scenario 2-A: load the tagger once, then process SHORT text 10 times
user system total real
2-A: eng tagger 0.600000 0.000000 0.600000 ( 0.652037)
2-A: rb tagger 1.950000 0.240000 2.190000 ( 2.252128)
Scenario 2-B: load the tagger once, then process LONG text 10 times
user system total real
2-B: eng tagger 2.500000 0.260000 2.760000 ( 2.840162)
2-B: rb tagger 2.710000 0.250000 2.960000 ( 3.048174)
Scenario 1-A: load the tagger each time before processing SHORT textWith 50 iterations, the EngTagger outperforms RubyTagger on all but the last task. Interestingly, the last scenario is the one that most closely matches the real-world application being built.
user system total real
1-A: eng tagger 45.370000 0.310000 45.680000 ( 46.750664)
1-A: rb tagger 117.960000 11.450000 129.410000 (132.715563)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 57.110000 3.610000 60.720000 ( 62.123540)
1-B: rb tagger 121.770000 11.050000 132.820000 (136.239764)
Scenario 2-A: load the tagger once, then process SHORT text 50.times
user system total real
2-A: eng tagger 0.790000 0.060000 0.850000 ( 0.896051)
2-A: rb tagger 2.040000 0.230000 2.270000 ( 2.340134)
Scenario 2-B: load the tagger once, then process LONG text 50.times
user system total real
2-B: eng tagger 9.340000 0.850000 10.190000 ( 10.528600)
2-B: rb tagger 6.000000 0.470000 6.470000 ( 6.636378)
Lessons Learned
- EngTagger and RubyTagger perform optimally under different conditions
- Benchmarks should mimic your application's usage as closely as possible
Tuesday, May 25, 2010
Statistical Analysis using Ruby
Having migrated to Ruby from a Java background, I sometimes find myself longing for the vast and robust libraries that exist in the Java ecosystem. While preparing for a natural language processing task, I considered revisiting the Weka software that I had used in the past for machine learning and statistical analysis. But, it sure would be nice if something existed in Ruby for this sort of work. Enter Statsample.
While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended: Multiple Regression and Pearson r Correlation Coefficient. Here are some examples of both at work:
While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended: Multiple Regression and Pearson r Correlation Coefficient. Here are some examples of both at work:
# Calculate correlation coefficient b=(1..100).collect { rand(100)}.to_scale Statsample::Bivariate.pearson(a,b) # Multiple Regression a=1000.times.collect {rand}.to_scale b=1000.times.collect {rand}.to_scale c=1000.times.collect {rand}.to_scale ds={'a'=>a,'b'=>b,'c'=>c}.to_dataset ds['y']=ds.collect{|row| row['a']*5+row['b']*3+row['c']*2+rand()} lr=Statsample::Regression.multiple(ds,'y') puts lr.summary Summary for regression of a,b,c over y ************************************************************* Engine: Statsample::Regression::Multiple::AlglibEngine Cases(listwise)=1000(1000) r=0.986 r2=0.973 Equation=0.504+5.011a + 2.995b + 1.988cSource: http://ruby-statsample.rubyforge.org/
Wednesday, March 17, 2010
Play Video Games and Save the World
I don't have as much time for video games as I once did, but it's nice to hear that video games will solve the world's problems. This video from TED.com points to additional emerging research that video games do not rot the mind as once thought:
http://www.ted.com/talks/jane_mcgonigal_gaming_can_make_a_better_world.html
Monday, March 8, 2010
Controlling Web-Bots with CurbIt
Whether you have a content-heavy site or a very application-centric website, bots and harvesters can wreak havoc by eating up CPU cycles, memory, and system resources. These ubiquitous pests will gladly retrieve all of your site's content with little regard for copyright laws or your terms of service.
One helpful tool for limiting many types of harmful bots and crawlers is a Ruby gem or plugin called CurbIt. CurbIt adds application level rate limiting to your Rails app. I recently had the pleasure of utilizing CurbIt on the Paper Rater website to limit the number of submissions. This helps us to ensure that humans are submitting documents, but without bothering our users with a CAPTCHA.
Example usage from the CurbIt github page:
One helpful tool for limiting many types of harmful bots and crawlers is a Ruby gem or plugin called CurbIt. CurbIt adds application level rate limiting to your Rails app. I recently had the pleasure of utilizing CurbIt on the Paper Rater website to limit the number of submissions. This helps us to ensure that humans are submitting documents, but without bothering our users with a CAPTCHA.
Example usage from the CurbIt github page:
class InvitesController < ApplicationController def invite # invite logic... end rate_limit :invite, :max_calls => 2, :time_limit => 30.seconds, :wait_time => 1.minute end
Thursday, February 18, 2010
Should You Move to the Cloud?
At Oasic we host a lot of websites and one of the most boring tasks involves setting up new servers, either for a new website or as part of a migration/expansion. It's one of my least favorite parts of the job. With many of the cloud offerings out there (Heroku, Google AppEngine, EngineYard, etc.), I've begun to dip my toes in the water. But why not just dive in?
Most Apps are NOT Designed for the Cloud
The concept of cloud hosting, along with its advantages and restrictions, is a relatively new phenomenon and most application frameworks, libraries, processes, and tools are not developed with the cloud in mind. Heroku's simple restriction of a read-only filesystem for all but a couple directories, means that you have to work around this. Restrictions on background processes means that you need to work with their related add-on, or come up with another solution. It begins to feel like deployment is a series of workarounds, and I'm not certain that's any better than setting up a new server.
Middle Ground
VPS providers like Amazon EC2, Slicehost, and WebbyNode provide a middle ground between a full cloud solution and hosting on dedicated servers. They allow instances to be setup more easily, scaled up and down, and backed up at nice prices and good uptime. And, they allow full access to the server. You can write to the filesystems, access databases directly, run background tasks, poke around on the server via SSH, create shell scripts, etc.
Further Reading
Moving to a full cloud stack requires a loss of flexibility that many businesses can't allow. Here is a good case study about GitHub's move out of the cloud:
http://github.com/blog/493-github-is-moving-to-rackspace
And, of course, there are many businesses that benefit from the advantages of moving the cloud: inherit load-balancing, fault-tolerance, pay-per-usage, ease of deployment, etc.
So, review your options, consider your needs, and make the best choice for your application's hosting.
Most Apps are NOT Designed for the Cloud
The concept of cloud hosting, along with its advantages and restrictions, is a relatively new phenomenon and most application frameworks, libraries, processes, and tools are not developed with the cloud in mind. Heroku's simple restriction of a read-only filesystem for all but a couple directories, means that you have to work around this. Restrictions on background processes means that you need to work with their related add-on, or come up with another solution. It begins to feel like deployment is a series of workarounds, and I'm not certain that's any better than setting up a new server.
Middle Ground
VPS providers like Amazon EC2, Slicehost, and WebbyNode provide a middle ground between a full cloud solution and hosting on dedicated servers. They allow instances to be setup more easily, scaled up and down, and backed up at nice prices and good uptime. And, they allow full access to the server. You can write to the filesystems, access databases directly, run background tasks, poke around on the server via SSH, create shell scripts, etc.
Further Reading
Moving to a full cloud stack requires a loss of flexibility that many businesses can't allow. Here is a good case study about GitHub's move out of the cloud:
http://github.com/blog/493-github-is-moving-to-rackspace
And, of course, there are many businesses that benefit from the advantages of moving the cloud: inherit load-balancing, fault-tolerance, pay-per-usage, ease of deployment, etc.
So, review your options, consider your needs, and make the best choice for your application's hosting.
Tuesday, February 16, 2010
Blogger Beats Typo
So I set out to create a blog for my website and spent some considerable time looking at possible options. Most of my development is done in Ruby these days, so I wanted to stick with a Ruby package if possible. These were my finalists:
Typo -- at least it seemed. I used the typo gem to install my blog. Added database permissions and began configuring things. I edited the default Hello World post that was auto-created. I was on my way until I created my first new post. No visual errors appeared, but the posts were not being saved. What could be going on?
I often enjoy a good troubleshooting session, but not this time around. I had 3 blogs needing to be setup (for various sites) and I didn't want to deal with these issues. I decided to try out Google's hosted Blogger and the rest is history!
Although there is no Markdown, Textile, or other markup suppport, it does include a WYSIWYG editor with full html tags. Here's some other benefits:
- Radiant CMS: not a pure blogging solution, but a full-featured, extensible CMS with blogging modules. It would allow me to add new functionality to my programmer-heart's delight.
- Typo: the oldest, tried-and-true Ruby blogging system out there. This heavyweight is a non-nonsense blogging machine.
- Mephisto: a leaner blogging app than Typo, but still with nice features.
Typo -- at least it seemed. I used the typo gem to install my blog. Added database permissions and began configuring things. I edited the default Hello World post that was auto-created. I was on my way until I created my first new post. No visual errors appeared, but the posts were not being saved. What could be going on?
I often enjoy a good troubleshooting session, but not this time around. I had 3 blogs needing to be setup (for various sites) and I didn't want to deal with these issues. I decided to try out Google's hosted Blogger and the rest is history!
Although there is no Markdown, Textile, or other markup suppport, it does include a WYSIWYG editor with full html tags. Here's some other benefits:
- use your own domain for free (or subdomain as I have done with this blog)
- fully customizable layout
- API for programmatic access
- 3 minute setup of a new blog
- Google handles everything for you (1 less component to manage)
- did I mention it was free?
Subscribe to:
Posts (Atom)