Tuesday, February 22, 2011

An Introduction to Predictive Analytics Contests and How a Non-Data Scientist Ranked in the Top 20% in a Global Predictive Analytics Contest

Just another of the many benefits of the social web is the ability to crowd-source expertise to help with business problems.

A great example of this is predictive analytics - or data mining - contests. The first time this was probably ever noticed by  mainstream business, was when Netflix announced a prize of one million dollars to anybody on the planet who could increase their ability to predict if someone was likely to enjoy a movie or not - The Netflix Prize. Genius concept. For a mere million dollars (not even at risk unless results were delivered) plus some logistics costs, they obtained access to literally the smartest data scientists in the world. Consider for a minute, if they set up an internal department tasked with solving this problem, a million dollars annually would not take them very far - a handful of staff, and of course they would be limited by whomever was looking for a job at that time. Instead, via this competition 51051 contestants entered from 186 different countries! You can be sure that contestants (now their outsourced predictive analytics department) ran the gamut from NASA scientists to university students.

Since this time  Kaggle has emerged, which is really a great idea. What they provide is a platform to host competitions like the Netflix Prize, but for any company. This idea is right at it's time in my opinion, and I believe has a tremendous amount to offer the world. Kaggle recently announced they will be hosting the Heritage Health Prize, which I have referred to in an earlier post. This is a $3 Million USD prize which I believe should make a meaningful difference to health care, and I suspect will cause a tipping point in increased focus on predictive analytics in this space.

 For a number of years, the seductively named Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining has held an annual contest - the KDD cup  http://www.sigkdd.org/kddcup/

An organization will typically donate some interesting data, and then data scientists from around the world will descend on it, and see how well they can do at building predictive models. In 2009 ORANGE (the seventh largest telecom operator in the world) put up some of their data about customer behavior. The objective was to estimate the churn, appetency and up-selling probability of their customers around the world.

I was recently asked how 11Ants Model Builder would perform in such a competition. Which I thought was a fair and valid question - only I did not know the answer. The software has been designed to allow non-experts to have access to predictive analytics. So my thesis was that it would probably perform quite well, but not as good as an expert data scientist - let's say, in a field like this, 'average'.

So I thought about what a fair test would be. I am not a data scientist. So I decided I would have a go at entering this competition myself using 11Ants Model Builder.

I loaded the data into Excel ensured it was formatted correctly, this took me about 15 minutes. (For anyone that is an expert, I did not apply an transformation to the data, simply used the raw data). Then I hit the button 'ANALYZE DATA & GENERATE MODELS'. Then I left it to run over the weekend.

The data mining competition then has a separate data set to test predictions on (without telling you the real values). You then upload your predicted results from this test set, and they immediately score your results (how close your predictions were to the true values), and put you onto the leader board.

The result: a non-data scientist finds himself placed in position 743 of 3,668 results - putting me in the top 20% of the contestants in a global data mining competition.


This part will make most your eye's glaze over, but in case anyone is interested in what the lift curve looked like on the up-sell model, this is how it looked. A lift chart shows how many times more likely a given customer is to be amenable to an up-sell, than if we randomly selected one. So at the level of ten it means we are ten times more likely to get a customer who will respond favorably to our up-sell offer. You can see as we work our way through the 'list' the level of lift drops - this is because the statistical certainty reduces as we move further down the list.


The major point I would want anyone to take away from this, is that advanced predictive analytics are accessible today, and to just about anyone. Consider that Orange is a phone company with 52 Billion Euros  revenue in 2009, and you can be pretty sure that they have got a pretty talented and expensive predictive analytics department. We can also deduce that some of the data scientists in this competition may have managed to do better than ORANGE (speculation, but quite likely to be the fact).

We should all be quite encouraged by the fact that over the course of a weekend (actually only 15 minutes of that was work, the balance was my laptop processing away) a single individual with some inexpensive software can build a predictive model that ranks him in the top 20% of data scientists in the world - even though he is not a data-scientist!

Thursday, February 3, 2011

Predictive Analytics at a Global Computer Manufacturer

Imagine you are a global computer manufacturer. You may or may not own the factories that produce your hardware, but one thing that you do own for sure - which is a significantly more complex beast - is effectively one massive data factory.

Consider for a moment all the data being produced by your organization, for your organization, and being dumped into your organization's prolific data warehouse system(s):

  • Marketing Data
  • Quality Data
  • Efficiency Data
  • Sales Data
  • Returns Data
  • Web Generated Data
  • Call Center Generated Data
  • Analyst Data
  • Investor Relations Data
  • Social Media Data
  • Maintenance Plan Data

And so it goes on...this barely scratches the surface of the data generated, but let's use it as the starting basis to make some sort of a point, which has been a long time coming...

As this is a blog on predictive analytics, the question returns as always, to how could predictive analytics be useful to such an organization, and why?

The overarching answer to the question 'why?' is likely something approximating: because our organization has developed a keen interest in looking forward, to what is about to happen, rather than just reporting what has happened in the past. This idea probably has general appeal to just about any sober executive, conceptually it is a great thought, after all.

But grandiose as it is, it doesn't really solve any specific business problem, rather it is just the theme that unites the solving of hundreds, thousands, or millions of business problems. Therefore, it worthwhile to consider 'why?' on a more micro-level. So we will address this relative to the data types listed above (very superficially) to gain a flavor for what predictive analytics can achieve:

  • Marketing Data - Which prospects are most likely to respond to an offer for product x? Which prospects are likely to respond favorably to an up-sell? How can I predict which segment a customer falls into with only partial information about them?
  • Quality Data - What are the big predictors to quality problems? Given scenario a,b,c,d - what is the quality level I can expect at the end of the production line?
  • Efficiency Data - What are the big predictors of more efficient manufacturing lines? Given scenario a,b,c,d - what is the efficiency level I can expect at the end of the production line?
  • Sales Data - How much are we likely to sell this day, week, month, quarter? What are the big predictors of sales volume?
  • Returns Data - Exactly which order to which customer is most likely to get returned? (A scored list for each daily shipment, with statistical probabilities of likelihood of their being a return issue attributed to each shipment, so that interventions can be placed on those classified as high risk). What are the big predictors of goods being returned to us?
  • Web Generated Data - Which behaviors on the website are most likely to lead to a purchase? Therefore which customers are more likely to purchase? Do we treat them differently?
  • Call Center Generated Data - Given variables a,b,c,d and e - what are my staffing requirements going to be?
  • Analyst Data - Which analysts are responsible for giving us the most coverage? How do I score analysts in terms of influence/importance?
  • Investor Relations Data - How do I score investor relations inquiries in terms of importance, so that the appropriate level of support can be provided to each inquiry?
  • Social Media Data - Are there any correlations between the fire hose of social media data being generated and business objectives? Can I predict future events from specific occurrences in the blogosphere?
  • Maintenance Plan Data - Which customers are most at risk of not renewing their maintenance plans? Which prospects are most likely to purchase maintenance plans? What are the biggest drivers of people not renewing? Which customers should I not aggressively offer maintenance plans to, as they are statistically unprofitable?

Most large organizations, and global computer manufacturers are no exception, have a dedicated predictive analytics department. Most frequently it is tucked away in a deep dark corner of the organization. They are usually tooled up with software from companies like IBM/SPSS or SAS - both great companies. The people in this department are clever people, and well trained in the software (which is what it requires - this software cannot be learned in a quick tutorial - try days to weeks). These people are also constantly troubled by the morons in the business who want to try some new predictive analytics project. Like any problem of resource allocation if your issue is not considered an organization priority it will likely not get done - even if it would help your specific department immeasurably. That is if you even know that this predictive analytics capability exists within your organization - trust me, most people don't.

Generally its fair to say that the performance of predictive analytics takes on the aura of alchemy - something you most certainly don't want to try at home, and most probably are too intimidated to even try at work... A regular business analyst dare not even consider it. So you either convince the predictive analytics department to do it, or put it in the 'too hard' basket.

At 11Ants Analytics we've been busy painting a new future, it looks something like this:

'Put the analytics capabilities into the hands of the people most motivated to solve the problem'

Imagine a world where anyone that could use Excel could perform advanced predictive analytics. The capability becomes widely deployed to subject matter experts (e.g. marketing, quality, etc) through-out an organization, even if they knew nothing about data mining and statistics.

That is what 11Ants Model Builder delivers - it is an incredibly easy to use Excel Add-In that allows anyone to practice advanced predictive analytics, and be up and running within minutes. Science fiction no more - go check it out.

Wednesday, February 2, 2011

Predictive Analytics World - March 14-15, 2011

A great conference for anyone interested in predictive analytics is Predictive Analytics World. The next one is coming up in San Francisco March 14-15, 2011. The organizers have been good enough to offer a 15% discount on the 2 day conference pass to readers of my blog.  http://www.predictiveanalyticsworld.com/sanfrancisco/2011/To take this offer up, use the following code when registering: TF11ABP.

Tuesday, February 1, 2011

Predictive Analytics for Border Protection

Imagine that you’ve been given responsibility for protecting a nation’s borders.  Your job may be any or all of the following: stop bad people from entering, stop illegal substances from entering, stop weapons from entering, or it may be to ensure that importers pay their fair share of customs duties. 

Whatever the objective, you will face a similar problem: how to block the most bad guys, with minimum disruption to the good guys. The minimum disruption to the good guys is important – both from a goodwill and from a cost standpoint – no government in the world has unlimited resources. Blocking the bad guys is obviously also very important - the more time you waste on good guys, the less likely you are to get an extra bad guy.

Predictive analytics hold massive potential to assist customs services, border protection agencies, and homeland security agencies to more efficiently sort through the ever increasing amount of noise thrown at them and isolate those individuals/shipments that should be given special attention.

Let’s look at an example of determining which incoming shipment is at risk of having under-declared customs duties. You can equally apply this to any other type of inspection task:

How is this done now?  Depending upon where in the world you are, the answer is largely a combination of ‘gut feel’, rules based instructions, and/or random checking. Depending upon the experience of the staff, there becomes an implicit understanding as to which shipments are perceived to be of greater risk. However, this experience is not evenly distributed – an inspector with 20 years of experience is by definition likely to have a better ‘gut feel’ than one on her second week on the job.

Fortunately, what many customs services around the world do have is a massive database recording every shipment that has ever entered their jurisdiction, along with every known case of  under-declared customs duties. Accordingly they possess a very valuable (yet under-utilized) asset: 1) lots of examples of ‘events’ (i.e. in this case: under declared customs duties) and 2) lots of potential inputs (or predictors) to help predict that event - the potential inputs are all the other information we have about the shipment, the bill of lading provides a plethora of such information (e.g. country of origin, ship name, port of loading, etc)

Though not many people automatically think of it like this, if correctly analyzed, this database holds the cumulative knowledge of every single customs officer that has ever found an under-declared shipment in the history of data gathering within that organization. The problem is this is literally impossible for a human to synthesize and therefore exploit - which is why not many people think of it in this way, this capability is relatively new.

So imagine if we could take all this historical data, throw some sophisticated predictive machine learning algorithms at it, and start analyzing that data for patterns which can help predict which shipments are likely to be under-declared. The emergent patterns are described in a 'predictive model'.

The next step is to run our database of today's incoming shipments against this model.  The model will apply a score to every shipment, the score would be a measure of risk – the higher the score, the more statistically likely it is to have under-declared its customs duties.  The output would be a list sorted by score, ranking the incoming shipment from most at risk to least at risk. Our inspection officers start at the top of the list (not the bottom, or the middle) and work their way down.  We can even statistically generate the optimal point in the list to stop.

This is not science fiction, but quite achievable. In some work I was involved with along these lines, at 11Ants Analytics,  a computer picked candidate for inspection was three times more likely to require inspection than a randomly selected one. The ultimate implementation of such a system would continue learning all the time, and continue having inspectors knowledge fed into it.

We can equally apply this principal to profiling of individuals, dangerous shipments, or many other things – the opportunities are massive.  All we require is: 1) an understanding of what would be useful for us to predict and 2) ancillary data relating to the examples which we can interrogate for some form of correlation.

If you are in a customs agency anywhere the world and interested in learning more about this type of application, please don’t hesitate to drop me an email.