Oasic: Ruby Part-of-Speech Tagger Shootout

An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing. I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well. Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.

The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:

EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
RubyTagger (rb-brill-tagger): a rule based tagger

Foundations of Statistical Natural Language Processing

Some of the ones excluded that you may be interested in considering:

Mark Watson's Tagger
YamCha - Yet Another Multipurpose CHunk Analyzer

Both EngTagger and RubyTagger provide a simple API and could be easily installed as gems. (NOTE: I believe the RubyTagger gem has a C dependency). In order to get a fairly accurate idea of performance, the gems were tested under 4 scenarios:

1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times

Before looking at the results, let's examine the main portion of the benchmark.rb file:

# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging
Benchmark.bmbm do |b|
   b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging
Benchmark.bmbm do |b|
   b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } }
end


# Scenario 2-A: load the tagger once, then process SHORT text
Benchmark.bmbm do |b|
   b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 2-B: load the tagger once, then process LONG text

Benchmark.bmbm do |b|
   b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } }
end

You may notice the extra processing of the EngTagger output:
engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.

The Results
The output of the benchmark script is below (and slightly tidied up).

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger   9.340000   0.040000   9.380000 ( 9.652550)
1-A: rb tagger   22.310000   2.180000 24.490000 ( 25.109431)

Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger 11.880000   0.710000 12.590000 ( 12.940737)
1-B: rb tagger   23.330000   2.350000 25.680000 ( 26.337501)

Scenario 2-A: load the tagger once, then process SHORT text 10 times
                      user     system      total        real
2-A: eng tagger   0.600000   0.000000   0.600000 ( 0.652037)
2-A: rb tagger    1.950000   0.240000   2.190000 ( 2.252128)

Scenario 2-B: load the tagger once, then process LONG text 10 times
                      user     system      total        real
2-B: eng tagger   2.500000   0.260000   2.760000 ( 2.840162)
2-B: rb tagger    2.710000   0.250000   2.960000 ( 3.048174)

It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10). After all, 10 times is rather small:

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger 45.370000   0.310000 45.680000 ( 46.750664)
1-A: rb tagger 117.960000 11.450000 129.410000 (132.715563)

Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger 57.110000   3.610000 60.720000 ( 62.123540)
1-B: rb tagger 121.770000 11.050000 132.820000 (136.239764)

Scenario 2-A: load the tagger once, then process SHORT text 50.times
                      user     system      total        real
2-A: eng tagger   0.790000   0.060000   0.850000 ( 0.896051)
2-A: rb tagger    2.040000   0.230000   2.270000 ( 2.340134)

Scenario 2-B: load the tagger once, then process LONG text 50.times
                      user     system      total        real
2-B: eng tagger   9.340000   0.850000 10.190000 ( 10.528600)
2-B: rb tagger    6.000000   0.470000   6.470000 ( 6.636378)

With 50 iterations, the EngTagger outperforms RubyTagger on all but the last task. Interestingly, the last scenario is the one that most closely matches the real-world application being built.

Lessons Learned

EngTagger and RubyTagger perform optimally under different conditions
Benchmarks should mimic your application's usage as closely as possible

Tuesday, June 1, 2010

Ruby Part-of-Speech Tagger Shootout

2 comments:

Search This Blog

Blog Archive

Tuesday, June 1, 2010

Ruby Part-of-Speech Tagger Shootout

2 comments:

Search This Blog

Subscribe

Blog Archive