Monday, December 23, 2013

How to Mislead the Public on MOOCs

An amazing thing happens when you free information from the fetters of expensive institutions where only the world's elite can access it.  People begin to learn things that they would never have the opportunity to learn otherwise.  'Student' begins to refer to more than just an 18-22 year old working towards a degree.  Knowledge and skills are acquired in one's free time and positive ripples flow outward.  Or, at least that's how many view it.  But, recently, I've seen a lot of negative press surrounding the efforts of Massive Open Online Courses (MOOCs) like Udacity, EdX, and Coursera.  I'm referring to articles like this that point to the high dropout rate on MOOC courses or the demographics, which are mostly educated citizens of industrialized nations.  But who is right?  Are MOOCs heaven sent, or angels in disguise?

Missing the Point
While it is true that high completion rates are better than low completion rates, this is far from being the whole story.   If we look at the absolute number of students that complete a course then the picture changes.  I enjoyed participating in the first AI MOOC offered by Thrun and Norvig, which attracted over 100K participants.  Of those, only a little over 20K (if I recall correctly) completed the course.  If we look only at percentages, it would be easy to call that a failure.  But look again and see that over 20,000 people completed a full-length introductory course on AI, which included challenging homework and tests!  20K+ people were able to take that knowledge with them and begin applying it to problems around them. 

And what of those who didn't?  An exit poll would best address this question, but we can speculate.  Many likely realized that they didn't have the time.  Others realized that the coursework was beyond their currently level and more prereqs would be needed.  Still others may have joined with no intention of completing the course.  They may have joined simply to audit the lectures.  Or perhaps they joined so that they could download the lectures to their hard-drive for later when they do have time.  There are many reasons to not complete a course, but the beauty of a well-designed MOOC is that the number of drop outs does not change the experience for those who do complete the course.  In other words, what's the big deal?

Finding a Lost World
One of the missions of the founders of MOOCs is to educate many of the marginalized across the world who would never have access to this information.  This goal may not be completely realized yet, but this does not mean failure either.  Columbus did not fail when his hope of finding a new route to India did not pan out.  Instead, he encountered a lost world and changed history forever.

I speak from experience when I say that MOOCs are changing the lives of people by opening doors and expanding their minds.  Perhaps already-educated suburbanites are not the target market of Coursera founders, but reaching the untapped potential of housewives, retirees, students who got the wrong degree, and people with demanding jobs (or kids) may be equally exciting.  And, I believe in time, the goal of reaching the poor around the globe will also see its day. 

We can all agree that improvements should be made to increase accessibility to the opportunities provided by MOOCs and that more work needs to be done to create greater engagement in these courses.  But, let's not exaggerate these issues to the point of labeling these courses as failures.  Thousands of people around the world have and continue to benefit tremendously from these courses.  The light that a single course switches on in someone's mind may illuminate the world around them for years to come.

Monday, November 18, 2013

Search API Notes

Scenario: You've got a great idea that requires indexing and/or search capabilities well beyond your budget.  Where do you go from here?

Thankfully, you have a few options to choose from when deciding how to power your new app.  Sadly, you have ONLY a few options to choose from.  Indexing and searching the Internet is a monstrous task, which is why this industry is a natural fit for the oligopoly we see today.  There are three players in this market that all offer Search APIs, but as of this writing, their products differ considerably.

Yahoo BOSS -
If you are looking for something inexpensive, then this is it.  They offer a 'limitedweb' search that is slightly smaller and not as fresh, but it's only $0.40/1000 queries, which is half the price of their 'web' offering.  Other than the cost savings, this service stinks.  Do not use this unless your application allows for a large margin of error and cost is the most important requirement.  I've found 3 types of common problems:
- False positives: returning results that do not contain the query.  It doesn't matter whether you are using an exact phrase search, boolean operators, etc.  Regardless, you will get false positives from time to time.
- False negatives: matching results that are in Yahoo's index fail to be returned sometimes
- Sporadic errors: the errors mentioned above, as well as other outages, occur frequently and randomly.  While developing with this API it was very frustrating because it does not return consistent results.  The same query will return no results one minute, then many results a minute later.  Frustrating.

Google -

On the other end of the spectrum is the dominant search giant.  Their API is high-quality and VERY expensive ($5/1000 queries).  Notice that that is more than 10X the cost of Yahoo's limitedweb queries.  Nevertheless, the Google results are consistent and of the quality you would expect.
Disadvantages: Besides price, Google's API results often do not match their public search results.  If you have a high volume app, the rate limits may be a deal-breaker for you (it was for us).

Microsoft Bing -
I'm rarely a fan of anything Microsoft produces, but they are the winner in my evaluation of web search APIs.  They have just the right mix of consistency, price, and performance, without the restrictions of Google.  They offer unlimited searches at a price that is roughly $1.25/1000 queries.  This is 1/4 of Google, but still 3X more than Yahoo's limitedweb.  For mission critical apps that can't afford the problems of BOSS, Bing is probably the best choice.  Be sure to use the "Web Only" API if you are only using their web search, as it is cheaper than their composite search offering.

Wednesday, October 23, 2013

The Hidden Abundance of IT Professionals

I'm not quite sure why the headlines I've been reading irk me so much.  Is it the fact that the conclusions drawn defy basic economics or the realization that ignorance is being spread quickly?  I've seen more than a few articles over the past few months stating an alleged shortage of computer security specialists, data analysts, and software developers in general.  It's the same idea that circulates from time to time in many tech fields as a means of justifying H1B Visas, but is it real?

To say there is a shortage of computer specialists, cyber warriors, or machine learning experts is like saying there is a shortage of Lamborghinis.  If everyone is willing to pay a higher price, then the shortage disappears.  It is true that there is a shortage of new Lamborghinis that sell for the price of a Toyota Camry, and this is the same strikingly foolish line of thinking that leads to news articles about the dearth of workers for technology ____ (fill in the blank).

Are you saying companies should just pay more and the problem will be solved?
Yes, this is precisely correct.  In the Western world we live in a mostly free market economy where supply and demand are the primary economic forces.  Saying that you want a Machine Learning expert for $90K may be a lot like saying you want a Lamborghini for $30K.  On the salary website, the national median for Radiologists is right at $250K.  According to the NY Times, demand for imaging has slowed since 2006 and recent radiology graduates are having difficulty finding jobs.  Nevertheless, hospitals have paid and continue to pay incredibly high salaries for this profession.  If we turn our attention to salaries for machine learning experts or even for directors of IT security, it should be clear to companies why there seems to be a shortage.  Even the maximum salaries listed across all companies do not contain a single entry above $200K.

Whether you are the Director of Machine Learning at Amazon or the Director of IT Security at UBS, whether you have a PhD in Computer Science or a dozen years mastering your enterprise's infrastructure, you will still earn less than the average radiologist.  This contrast remains even while there is an alleged shortage of these IT professions and a surplus of radiologists.

What then is holding companies back from paying more?  I believe the answer to this varies based on the particular job, but the root cause is that people stray from rational choices.  Let's consider some plausible scenarios. 

Russian Roulette Scenario
Upper management of Company XYZ has a hunch that security is lacking around its services, but so far they have avoided any major catastrophes.  They lost one computer security specialist to a better opportunity last month, but they are reluctant to increase the pay of the remaining specialists and post a new job for a replacement.  In short, they've chosen to increase their risk to save some on human capital costs.  When a serious and costly security breech does occur, they will likely claim to be a victim of a small pool of talented workers and the media will run with that headline.

Irrationally Too Expensive Scenario
Let's face it, we are emotional creatures.  A quick read of Dan Areily's work will convince you that humans frequently make harmful, irrational choices, especially when it comes to pricing.  This is certainly the case when it comes to salaries.  Rarely does an executive ask, "What is this position (or person) worth to our company?" and pay that person accordingly.  Instead, they try to conjure up a figure based on their own beliefs (e.g., doctors deserve more than PhD's), or they look at what everyone else seems to be paying.  The latter approach will often work, but not if most employers are also paying less than they should.  The bottom line is that you need to increase the pay (and perhaps the exposure of the job listing) if you are not able to hire for the position.  If a worker can save (or gain) $500K per year for a company, then it would be foolish to throw your hands up in the air and say that there is a shortage when you can't find a senior developer for $120K.  Simply increase the salary to $150 and see what happens.  The difference between the salary and expected gain from the employee is still much greater than zero.  In short, employers need to be rational.  They should use basic math and economics rather than emotions and preconceived notions.

I'm sure there are other causes as well -- feel free to post some in the comments.  Thanks for reading!

Thursday, August 29, 2013

Resources for the Internet of Things

Whether you refer to it as The Internet of Things, the Sensor Revolution, or the Programmable World, here are some potentially useful resources:
Microcontrollers & Systems on a Chip
  • TinkerForge - Stackable, programmable boards.  Appears to be simpler than Arduino in many ways.
  • Arduino - Everyone's favorite open-source microcontroller
  • BeagleBone Black
  • Raspberry Pi
  • Galago -
  • Intel Galileo - open-source board from Intel

I will try to expand this list over time.  If you have more, please add them in the comments.  Thanks!

Thursday, January 3, 2013

Will a Robot Take My Job?

With many jobs lost and the economy teetering on recovery, this question couldn't be more relevant. The man vs machine debate has a long history and conjures up images of John Henry racing to his death against a steam hammer. The reality is that machines already do the work of people and will continue to usurp greater and greater responsibilities. If your job is not one of the ones lost, then it is quite easy to see that this expansion should be welcome as it allows for greater societal productivity. After all, do you miss doing dishes by hand? Wish you could wash your laundry in a tub and hang it out to dry? Want to pay greater prices for hand made products? The key is understand which jobs are likely to be filled by machines. The following 5 questions should help.


Is your work highly repetitive?  
Specialized, repetitive tasks are easier to automate.  It is the human ability to generalize that makes our human "wetware" different than robotic hardware. Maybe you execute the same movements as part of a manual labor position, or maybe you spend the day downloading the same data and plugging it into the same spreadsheet.  Whatever the case may be, if you find yourself doing the same tasks over and over, then your job is more likely to be taken by a machine.

Are there many people in your company in the same role as you?
If so, you are a great target for automation.  Developing algorithms and/or hardware to replace one person is often not cost effective, but it may be a no-brainer for a company to invest in machines that replace 100 people.

What collar do you wear?  White or blue?
Previous generations of machines have found great success replacing manual labor in factories, but the next generation of machines will be more likely to replace office workers -- white collar folks.  The reason for this is that massive data and computing power have reached a critical mass allowing difficult "thinking" tasks to be conquered by machines.  We now have machines that can dominate Jeopardy, diagnose cancer from a breast scan, and grade essays.  This trend will only increase as more data and computing power become available in the years ahead. 

On the other hand, most jobs that involve service work or manual labor will be difficult for robots to replace.  The reality is that complex analytical thinking will be easier to duplicate than our nerves and muscles.  We take for granted how easily we handle sophisticated movements, but it will be many years before robots are equipped with the sensors and actuators comparable to a human's.  Until then, don't expect the moving company to send in robots to pack up your house.

Do you interact with people?
Many jobs require a uniquely human touch, something that machines do not offer -- something that machines may never offer.  If meaningful interaction with people is an important part of your job, then you are unlikely to be replaced by a machine any time soon.  Don't expect to be seeing a salesperson or a robotic shrink in the next century.

The bottom line is that the use of machines will continue to allow society to be more productive, and productivity often means one person doing the work of many.  We should embrace this new potential, but also be cognizant of the skill sets most needed in the 21st century.

Wednesday, September 12, 2012

Quick Solutions: Rails 3, send_data, and garbled PDF output

This is just a quick blog entry to help anyone experiencing garbled PDF output (in Safari) when using Rails 3's send_data to output dynamically-generated PDF files. 

If you are following the documentation, then you are probably outputting something like this:

    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

I don't know if this is specific to Rails 3, but the issue is that the Content-Type header is not being set to 'application/pdf', so setting it explicitly in the response should fix this issue:

    send_data data, :filename => "myfile.pdf",
                          :type => 'application/pdf'

Tuesday, June 26, 2012

Machine Learning: Naive Bayes Classification with Ruby

Maybe you've wondered, "Where are all the Ruby libraries for Machine Learning and NLP?".  Despite Ruby's growing user base and ability to quickly manipulate data and text, there seems to be a dearth of tools for NLP and Machine Learning.  One statistical tool that finds itself in the intersection of NLP and Machine Learning is Naive Bayes.  [For those already familiar with Naive Bayes, you may wish to skip ahead to the Quickstart section below (although I promise not to be long-winded in my intro).]    

This peculiarly-named approach to classification tasks is based on the well-known Bayes' Rule and is used to calculate the probability an instance belongs in a particular class (or category) based on the components of the instance.   It is called "Naive" Bayes because of the manner in which it calculates the probabilities.  It treats each component as if it is independent from other components, even though this is usually not the case.  Suprisingly, Naive Bayes does quite well and is optimal in the case in which the components of an instance actually are independent. 

Enough generalities...Naive Bayes is used widely for text classification problems and the "components" that I referenced above are actually tokens -- typically words.  Even if you have no desire to understand the probabilistic engine beneath the hood, Naive Bayes is easy to use, high performance, and accurate relative to other classifiers.  It requires a 2 step process:

1) Train the classifier by providing it with sets of tokens (e.g., words) accompanied by a class (e.g., 'SPAM')
2) Run the trained classifier on un-classified (e.g., unlabelled) tokens and it will predict a class


gem install nbayes

After that, it's time to begin training the classifier:
# create new classifier instance
nbayes =
# train it - notice split method used to tokenize text (more on that below)
nbayes.train( "You need to buy some Viagra".split(/\s+/), 'SPAM' )
nbayes.train( "This is not spam, just a letter to Bob.".split(/\s+/), 'HAM' )
nbayes.train( "Hey Oasic, Do you offer consulting?".split(/\s+/), 'HAM' )
nbayes.train( "You should buy this stock".split(/\s+/), 'SPAM' )

Finally, let's use it to classify a document:

# tokenize message
tokens = "Now is the time to buy Viagra cheaply and discreetly".split(/\s+/)
result = @nbayes.classify(tokens)
# print likely class (SPAM or HAM)
p result.max_class
# print probability of message being SPAM
p result['SPAM']
# print probability of message being HAM
p result['HAM']

But that's not all!  I'm claiming that this is a full-featured Naive Bayes implementation, so I better back that up with information about all the goodies.  Here we go:

  • Works with all types of tokens, not just text.  Of course, because of this, we leave tokenization up to you.
  • Disk based persistence
  • Allows prior distribution on classes to be assumed uniform (optional)
  • Outputs probabilities, instead of just class w/max probability
  • Customizable constant value for Laplacian smoothing
  • Optional and customizable purging of low-frequency tokens (for performance)
  • Optional binarized mode to reduce the impact of repeated words
  • Uses log probabilities to avoid underflow

I hope to post examples of these features in action in a future post.  Until then, view nbayes_spec.rb for usage.