Tuesday, February 22, 2011

An Introduction to Predictive Analytics Contests and How a Non-Data Scientist Ranked in the Top 20% in a Global Predictive Analytics Contest

Just another of the many benefits of the social web is the ability to crowd-source expertise to help with business problems.

A great example of this is predictive analytics - or data mining - contests. The first time this was probably ever noticed by  mainstream business, was when Netflix announced a prize of one million dollars to anybody on the planet who could increase their ability to predict if someone was likely to enjoy a movie or not - The Netflix Prize. Genius concept. For a mere million dollars (not even at risk unless results were delivered) plus some logistics costs, they obtained access to literally the smartest data scientists in the world. Consider for a minute, if they set up an internal department tasked with solving this problem, a million dollars annually would not take them very far - a handful of staff, and of course they would be limited by whomever was looking for a job at that time. Instead, via this competition 51051 contestants entered from 186 different countries! You can be sure that contestants (now their outsourced predictive analytics department) ran the gamut from NASA scientists to university students.

Since this time  Kaggle has emerged, which is really a great idea. What they provide is a platform to host competitions like the Netflix Prize, but for any company. This idea is right at it's time in my opinion, and I believe has a tremendous amount to offer the world. Kaggle recently announced they will be hosting the Heritage Health Prize, which I have referred to in an earlier post. This is a $3 Million USD prize which I believe should make a meaningful difference to health care, and I suspect will cause a tipping point in increased focus on predictive analytics in this space.

 For a number of years, the seductively named Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining has held an annual contest - the KDD cup  http://www.sigkdd.org/kddcup/

An organization will typically donate some interesting data, and then data scientists from around the world will descend on it, and see how well they can do at building predictive models. In 2009 ORANGE (the seventh largest telecom operator in the world) put up some of their data about customer behavior. The objective was to estimate the churn, appetency and up-selling probability of their customers around the world.

I was recently asked how 11Ants Model Builder would perform in such a competition. Which I thought was a fair and valid question - only I did not know the answer. The software has been designed to allow non-experts to have access to predictive analytics. So my thesis was that it would probably perform quite well, but not as good as an expert data scientist - let's say, in a field like this, 'average'.

So I thought about what a fair test would be. I am not a data scientist. So I decided I would have a go at entering this competition myself using 11Ants Model Builder.

I loaded the data into Excel ensured it was formatted correctly, this took me about 15 minutes. (For anyone that is an expert, I did not apply an transformation to the data, simply used the raw data). Then I hit the button 'ANALYZE DATA & GENERATE MODELS'. Then I left it to run over the weekend.

The data mining competition then has a separate data set to test predictions on (without telling you the real values). You then upload your predicted results from this test set, and they immediately score your results (how close your predictions were to the true values), and put you onto the leader board.

The result: a non-data scientist finds himself placed in position 743 of 3,668 results - putting me in the top 20% of the contestants in a global data mining competition.


This part will make most your eye's glaze over, but in case anyone is interested in what the lift curve looked like on the up-sell model, this is how it looked. A lift chart shows how many times more likely a given customer is to be amenable to an up-sell, than if we randomly selected one. So at the level of ten it means we are ten times more likely to get a customer who will respond favorably to our up-sell offer. You can see as we work our way through the 'list' the level of lift drops - this is because the statistical certainty reduces as we move further down the list.


The major point I would want anyone to take away from this, is that advanced predictive analytics are accessible today, and to just about anyone. Consider that Orange is a phone company with 52 Billion Euros  revenue in 2009, and you can be pretty sure that they have got a pretty talented and expensive predictive analytics department. We can also deduce that some of the data scientists in this competition may have managed to do better than ORANGE (speculation, but quite likely to be the fact).

We should all be quite encouraged by the fact that over the course of a weekend (actually only 15 minutes of that was work, the balance was my laptop processing away) a single individual with some inexpensive software can build a predictive model that ranks him in the top 20% of data scientists in the world - even though he is not a data-scientist!

No comments:

Post a Comment