Tuesday, June 1, 2010

Ruby Part-of-Speech Tagger Shootout

 An accurate and efficient Part of Speech Tagger represents a valuable tool for various areas of natural language processing.  I use POS Tagging as a means of detecting invalid text, but there are many other possible uses as well.  Regardless of how you are using a POS Tagger, you may find this benchmark of two Ruby POS Tagging libraries helpful.



The Players
There are several taggers available, but I settled on testing the following two that are available as gems and seemingly robust:
  1. EngTagger: a corpus-trained, probabilistic tagger (port of Perl Lingua::EN::Tagger)
  2. RubyTagger (rb-brill-tagger): a rule based tagger
Foundations of Statistical Natural Language ProcessingSome of the ones excluded that you may be interested in considering:
Both EngTagger and RubyTagger provide a simple API and could be easily installed as gems.  (NOTE: I believe the RubyTagger gem has a C dependency).  In order to get a fairly accurate idea of performance, the gems were tested under 4 scenarios:

1-A: 10 times create an instance of the tagger and tag a short piece of text
1-B: 10 times create an instance of the tagger and tag a long piece of text
2-A: create an instance of the tagger once and tag a short piece of text 10 times
2-B: create an instance of the tagger once and tag a long piece of text 10 times

Before looking at the results, let's examine the main portion of the benchmark.rb file:
# Scenario 1-A: load the tagger each time before processing text; use SHORT text for tagging
Benchmark.bmbm do |b|
   b.report("1-A: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-A: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 1-B: load the tagger each time before processing text; use LONG text for tagging
Benchmark.bmbm do |b|
   b.report("1-B: eng tagger") {10.times { engtagger = EngTagger.new; engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("1-B: rb tagger") {10.times { rbtagger = Brill::Tagger.new; rbtagger.tag(LONG_TEXT) } }
end


# Scenario 2-A: load the tagger once, then process SHORT text
Benchmark.bmbm do |b|
   b.report("2-A: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-A: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(SHORT_TEXT) } }
end

# Scenario 2-B: load the tagger once, then process LONG text

Benchmark.bmbm do |b|
   b.report("2-B: eng tagger") { engtagger = EngTagger.new; 10.times { engtagger.get_readable(LONG_TEXT).split(' ').collect{|x| x.split('/')} }}
   b.report("2-B: rb tagger") {rbtagger = Brill::Tagger.new; 10.times { rbtagger.tag(LONG_TEXT) } }
end

You may notice the extra processing of the EngTagger output:
    engtagger.get_readable(SHORT_TEXT).split(' ').collect{|x| x.split('/')}
This is done so that the output matches that of the RubyTagger, which is an array of 2-item arrays.

The Results
The output of the benchmark script is below (and slightly tidied up).


Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger   9.340000   0.040000   9.380000 (  9.652550)
1-A: rb tagger   22.310000   2.180000  24.490000 ( 25.109431)


Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger  11.880000   0.710000  12.590000 ( 12.940737)
1-B: rb tagger   23.330000   2.350000  25.680000 ( 26.337501)

Scenario 2-A: load the tagger once, then process SHORT text 10 times
                      user     system      total        real
2-A: eng tagger   0.600000   0.000000   0.600000 (  0.652037)
2-A: rb tagger    1.950000   0.240000   2.190000 (  2.252128)


Scenario 2-B: load the tagger once, then process LONG text 10 times
                      user     system      total        real
2-B: eng tagger   2.500000   0.260000   2.760000 (  2.840162)
2-B: rb tagger    2.710000   0.250000   2.960000 (  3.048174)
It appears that the EngTagger outperforms the RubyTagger in all tests BUT what happens if we change the number of iterations to 50 (rather than 10).  After all, 10 times is rather small:

Scenario 1-A: load the tagger each time before processing SHORT text
                      user     system      total        real
1-A: eng tagger  45.370000   0.310000  45.680000 ( 46.750664)
1-A: rb tagger  117.960000  11.450000 129.410000 (132.715563)


Scenario 1-B: load the tagger each time before processing LONG text
                      user     system      total        real
1-B: eng tagger  57.110000   3.610000  60.720000 ( 62.123540)
1-B: rb tagger  121.770000  11.050000 132.820000 (136.239764)


Scenario 2-A: load the tagger once, then process SHORT text 50.times
                      user     system      total        real
2-A: eng tagger   0.790000   0.060000   0.850000 (  0.896051)
2-A: rb tagger    2.040000   0.230000   2.270000 (  2.340134)


Scenario 2-B: load the tagger once, then process LONG text 50.times
                      user     system      total        real
2-B: eng tagger   9.340000   0.850000  10.190000 ( 10.528600)
2-B: rb tagger    6.000000   0.470000   6.470000 (  6.636378)
With 50 iterations, the EngTagger outperforms RubyTagger on all but the last task.  Interestingly, the last scenario is the one that most closely matches the real-world application being built.

Lessons Learned
  1. EngTagger and RubyTagger perform optimally under different conditions
  2. Benchmarks should mimic your application's usage as closely as possible

2 comments:

  1. How long is "long" (and conversely, "short")

    ReplyDelete
  2. Sorry, didn't see the comment. The short text was about 50 words, while the long text was roughly 300-500 words in length.

    ReplyDelete