Tuesday, June 15, 2010
Beware of YSlow and GZip
Much time can be wasted trying to configure mod_deflate IF it is already properly configured. Although I had previously enabled compression using this simple and powerful Apache module, YSlow was giving me an F for not compressing the page text as well as js and css files. To make a short story even shorter I will simply say this:
YSlow does not accurately detect gzip compression. If you are uncertain, check the response headers for "Vary: Accept Encoding" or use Port80's Compression Check.
Tuesday, June 1, 2010
Ruby Part-of-Speech Tagger Shootout
An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing. I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well. Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.
The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times
Before looking at the results, let's examine the main portion of the benchmark.rb file:
You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.
The Results
The output of the benchmark script is below (and slightly tidied up).
Lessons Learned
The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
- EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
- RubyTagger (rb-brill-tagger): a rule based tagger
- Mark Watson's Tagger
- YamCha - Yet Another Multipurpose CHunk Analyzer
1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times
Before looking at the results, let's examine the main portion of the benchmark.rb file:
# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging Benchmark.bmbm do |b| b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } } end # Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging Benchmark.bmbm do |b| b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } } end # Scenario 2-A: load the tagger once, then process SHORT text Benchmark.bmbm do |b| b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } } end # Scenario 2-B: load the tagger once, then process LONG text Benchmark.bmbm do |b| b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } } end
You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.
The Results
The output of the benchmark script is below (and slightly tidied up).
It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10). After all, 10 times is rather small:
Scenario 1-A: load the tagger each time before processing SHORT text
user system total real
1-A: eng tagger 9.340000 0.040000 9.380000 ( 9.652550)
1-A: rb tagger 22.310000 2.180000 24.490000 ( 25.109431)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 11.880000 0.710000 12.590000 ( 12.940737)
1-B: rb tagger 23.330000 2.350000 25.680000 ( 26.337501)
Scenario 2-A: load the tagger once, then process SHORT text 10 times
user system total real
2-A: eng tagger 0.600000 0.000000 0.600000 ( 0.652037)
2-A: rb tagger 1.950000 0.240000 2.190000 ( 2.252128)
Scenario 2-B: load the tagger once, then process LONG text 10 times
user system total real
2-B: eng tagger 2.500000 0.260000 2.760000 ( 2.840162)
2-B: rb tagger 2.710000 0.250000 2.960000 ( 3.048174)
Scenario 1-A: load the tagger each time before processing SHORT textWith 50 iterations, the EngTagger outperforms RubyTagger on all but the last task. Interestingly, the last scenario is the one that most closely matches the real-world application being built.
user system total real
1-A: eng tagger 45.370000 0.310000 45.680000 ( 46.750664)
1-A: rb tagger 117.960000 11.450000 129.410000 (132.715563)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 57.110000 3.610000 60.720000 ( 62.123540)
1-B: rb tagger 121.770000 11.050000 132.820000 (136.239764)
Scenario 2-A: load the tagger once, then process SHORT text 50.times
user system total real
2-A: eng tagger 0.790000 0.060000 0.850000 ( 0.896051)
2-A: rb tagger 2.040000 0.230000 2.270000 ( 2.340134)
Scenario 2-B: load the tagger once, then process LONG text 50.times
user system total real
2-B: eng tagger 9.340000 0.850000 10.190000 ( 10.528600)
2-B: rb tagger 6.000000 0.470000 6.470000 ( 6.636378)
Lessons Learned
- EngTagger and RubyTagger perform optimally under different conditions
- Benchmarks should mimic your application's usage as closely as possible
Subscribe to:
Posts (Atom)