## Tuesday, May 25, 2010

### Statistical Analysis using Ruby

Having migrated to Ruby from a Java background, I sometimes find myself longing for the vast and robust libraries that exist in the Java ecosystem.  While preparing for a natural language processing task, I considered revisiting the Weka software that I had used in the past for machine learning and statistical analysis.  But, it sure would be nice if something existed in Ruby for this sort of work.  Enter Statsample.

While not a machine learning package, this statistics library utilizes the Gnu Scientific Library to provide me with the two features upon which this task depended:  Multiple Regression and Pearson r Correlation Coefficient.  Here are some examples of both at work:

```    # Calculate correlation coefficient
b=(1..100).collect { rand(100)}.to_scale
Statsample::Bivariate.pearson(a,b)
# Multiple Regression

a=1000.times.collect {rand}.to_scale
b=1000.times.collect {rand}.to_scale
c=1000.times.collect {rand}.to_scale
ds={'a'=>a,'b'=>b,'c'=>c}.to_dataset
ds['y']=ds.collect{|row| row['a']*5+row['b']*3+row['c']*2+rand()}
lr=Statsample::Regression.multiple(ds,'y')
puts lr.summary
Summary for regression of a,b,c over y
*************************************************************
Engine: Statsample::Regression::Multiple::AlglibEngine
Cases(listwise)=1000(1000)
r=0.986
r2=0.973
Equation=0.504+5.011a + 2.995b + 1.988c
```
Source: http://ruby-statsample.rubyforge.org/