Tuesday, May 25, 2010

Statistical Analysis using Ruby

Having migrated to Ruby from a Java background, I sometimes find myself longing for the vast and robust libraries that exist in the Java ecosystem.  While preparing for a natural language processing task, I considered revisiting the Weka software that I had used in the past for machine learning and statistical analysis.  But, it sure would be nice if something existed in Ruby for this sort of work.  Enter Statsample.


While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended:  Multiple Regression and Pearson r Correlation Coefficient.  Here are some examples of both at work:

    # Calculate correlation coefficient
    b=(1..100).collect { rand(100)}.to_scale
    Statsample::Bivariate.pearson(a,b)
    # Multiple Regression

    a=1000.times.collect {rand}.to_scale
    b=1000.times.collect {rand}.to_scale
    c=1000.times.collect {rand}.to_scale
    ds={'a'=>a,'b'=>b,'c'=>c}.to_dataset
    ds['y']=ds.collect{|row| row['a']*5+row['b']*3+row['c']*2+rand()}
    lr=Statsample::Regression.multiple(ds,'y')
    puts lr.summary
    Summary for regression of a,b,c over y
    *************************************************************
    Engine: Statsample::Regression::Multiple::AlglibEngine
    Cases(listwise)=1000(1000)
    r=0.986
    r2=0.973
    Equation=0.504+5.011a + 2.995b + 1.988c
  Source: http://ruby-statsample.rubyforge.org/