The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
- EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
- RubyTagger (rb-brill-tagger): a rule based tagger
- Mark Watson's Tagger
- YamCha - Yet Another Multipurpose CHunk Analyzer
1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times
Before looking at the results, let's examine the main portion of the benchmark.rb file:
# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging Benchmark.bmbm do |b| b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } } end # Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging Benchmark.bmbm do |b| b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } } end # Scenario 2-A: load the tagger once, then process SHORT text Benchmark.bmbm do |b| b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } } end # Scenario 2-B: load the tagger once, then process LONG text Benchmark.bmbm do |b| b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }} b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } } end
You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.
The Results
The output of the benchmark script is below (and slightly tidied up).
It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10). After all, 10 times is rather small:
Scenario 1-A: load the tagger each time before processing SHORT text
user system total real
1-A: eng tagger 9.340000 0.040000 9.380000 ( 9.652550)
1-A: rb tagger 22.310000 2.180000 24.490000 ( 25.109431)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 11.880000 0.710000 12.590000 ( 12.940737)
1-B: rb tagger 23.330000 2.350000 25.680000 ( 26.337501)
Scenario 2-A: load the tagger once, then process SHORT text 10 times
user system total real
2-A: eng tagger 0.600000 0.000000 0.600000 ( 0.652037)
2-A: rb tagger 1.950000 0.240000 2.190000 ( 2.252128)
Scenario 2-B: load the tagger once, then process LONG text 10 times
user system total real
2-B: eng tagger 2.500000 0.260000 2.760000 ( 2.840162)
2-B: rb tagger 2.710000 0.250000 2.960000 ( 3.048174)
Scenario 1-A: load the tagger each time before processing SHORT textWith 50 iterations, the EngTagger outperforms RubyTagger on all but the last task. Interestingly, the last scenario is the one that most closely matches the real-world application being built.
user system total real
1-A: eng tagger 45.370000 0.310000 45.680000 ( 46.750664)
1-A: rb tagger 117.960000 11.450000 129.410000 (132.715563)
Scenario 1-B: load the tagger each time before processing LONG text
user system total real
1-B: eng tagger 57.110000 3.610000 60.720000 ( 62.123540)
1-B: rb tagger 121.770000 11.050000 132.820000 (136.239764)
Scenario 2-A: load the tagger once, then process SHORT text 50.times
user system total real
2-A: eng tagger 0.790000 0.060000 0.850000 ( 0.896051)
2-A: rb tagger 2.040000 0.230000 2.270000 ( 2.340134)
Scenario 2-B: load the tagger once, then process LONG text 50.times
user system total real
2-B: eng tagger 9.340000 0.850000 10.190000 ( 10.528600)
2-B: rb tagger 6.000000 0.470000 6.470000 ( 6.636378)
Lessons Learned
- EngTagger and RubyTagger perform optimally under different conditions
- Benchmarks should mimic your application's usage as closely as possible
How long is "long" (and conversely, "short")
ReplyDeleteSorry, didn't see the comment. The short text was about 50 words, while the long text was roughly 300-500 words in length.
ReplyDelete