Predictive Analytics for Everyone: October 2011

Saturday, October 29, 2011

Microsoft Data Explorer - Predictive Analytics Next Inflection Point?

A very interesting development from Microsoft which may well bring a major inflection point in the adoption of predictive analytics.

On October 13, 2011 a post appeared on Tim Mallalieu's Blog. Tim is a Group Program Manager at Microsoft, and his blog post revealed that for the last 14 months Microsoft have been very quietly working on developing an ETL tool. The product has been assigned Microsoft Codename Data Explorer (and was previously referred to as Montego). ETL stands for Extract, Transform and Load. At risk of greatly oversimplifying it, what an ETL tool enables you to do is to access data from multiple storage silos, bring only the parts of the data that you need, and bring them into a place that you can work on the data. So for example we can have some data in our CRM system, and some data in our transactional database, and (say) weather data from the Weather Channel website; with an ETL tool, we can bring all the data into the same place, so it it is ready for us to work with, we can constantly refresh the data so that it is always the most recent data (but with the ETL performed on it). Not just that it is there in one place, but that only the relevant data that we actually require is there, in one place, ready for us to do something with. This is actually pretty neat, though is not particularly new. For more see ETL .

What is brand new though, and what really holds the potential to be game changing, is that now nearly anyone will be able to do this. It was previously a highly complicated affair, laden with integration and a lot of complex work – people didn’t want to try this at home, and if they weren’t experts they didn’t really want to try it at work either. But it appears that Microsoft have completely changed that (I say that without having used the product, but the vision here is very clear). This is good in its own right (ETL has many uses), but when it comes to predictive analytics it is has some quite serious implications (in a positive way).

At risk of stating the obvious, predictive analytics at its heart relies on building predictive models, and predictive models rely on data – often lots of data, and often disparate data.

Our experience when we launched our desktop modelling tools (11Ants Model Builder, 11Ants Customer Churn Analyzer, and 11Ants CustomerResponse Analyzer) was that we overnight were able to trivialize the technically most complex part of model building (I say overnight...the technology took us over three years to develop, but one night it was finished!) and suddenly people that hadn’t contemplated building predictive models found they could build them with very little effort. Now suddenly a business analyst with a basic understanding of Excel could build models with no requirement for understanding machine learning algorithms, etc. Also experienced model builders could lift the quality of their models with reduced development time (for a paper on how to beat 85% of the submissions in an international data mining contest with less than 50 minutes work refer to 11Ants Customer Churn Analyzer outperforms 85% of Submissions in International Predictive Analytics Contest).

However as all good students of the Theory of Constraints know, as soon as we remove one constraint, we clear the way for the next constraint to become the rate limiter (there is a good book about this incidentally: The Goal). Well it turns out that the new rate limiting step after the trivializing of the algorithm selection and evaluation is the extraction and preparation of the data.

The reality is that predictive analytics can be considered like a science experiment, or more correctly, lots of science experiments. As we know with science experiments we are testing a hypothesis, we may suspect we know the outcome, but we can’t really be sure until we have completed the actual experiment. In our case, we decide what we would like to predict, then we ask ourselves ‘what data can plausibly be correlated to what we are trying to predict?’, then we put that data into a predictive analytics tool (e.g. 11Ants Model Builder, SAS Enterprise Miner, IBM’s SPSS Modeller, etc) and proceed to build a model, and back test it and see how well it is able to actually predict. Sometimes we get a satisfactory outcome, and sometimes we don’t.

So a big challenge to running multiple experiments involving different and disparate data (we’ve already solved the problem of doing different experiments on the same data by automating the algorithm selection with tools like 11Ants Model Builder) is bringing in the data. It lives all over the place, and when you have to herd cats to bring the data into one place to be able to begin working on it, then you have a legitimate constraint.

Effectively Microsoft appears to have made the herding of the cats a lot easier, for a lot more people. When you make something accessible to a lot more people, interesting things start happening, a lot more science projects get performed, and a lot more useful applications begin to develop.

If a relatively small company had developed this, I am not sure that I would make the claim that it was going to herald an inflection point, but the fact that it is Microsoft means that it there is going to be plenty of air time, plenty of credibility, plenty of sales effort, and generally plenty of attention and I think we will find that the combination of all the above will indeed cause an inflection point.

Thursday, October 6, 2011

Where to Begin with Predictive Analytics and Black Swan Events

James Taylor has written an excellent and informative article for Information Management Where to Begin with Predictive Analytics it is recommended reading for executives considering how to most prudently begin deploying predictive analytics, but struggling to determine a clear starting point.

I couldn't agree more with all the points he makes. It also makes me think of a conversation I had yesterday about the flip side of this: Black Swan events - events that are extremely rare, but when they happen completely destroy the performance of any predictive model you have made.

The point James makes about focussing on transactional versus strategic decisions in my view minimizes the Black Swan concern, which is a natural concern for many people when looking at predictive analytics critically (or indeed any endeavour where one makes future decisions based upon past behaviour).

Micro-decisions made on the basis of thousands to millions of historical examples are likely to contain quiet a few black swan events within them (often known as outliers) but they are diluted by the massive number of 'normal' transactions. Further the magnitude of investment in each micro-decison (recommended by the model) is relatively trivial, so you are not relying on each prediction being perfect, rather a statistically significant number being good enough to give you better performance than random.

Accordingly, the Black Swan effect does not become a major element of risk - whereas it certainly can be when making a single high magnitude decision (e.g. how high to place the back-up generator at a nuclear reactor in Fukushima, based upon historical high water marks or whether it would have been safe to make a bet two years ago that there would be no earthquakes in Christchurch, New Zealand).

No doubt someone will be able to point out examples where a black swan event can affect micro-decisions too (and please feel free to), however the point remains that generally you are on significantly more solid ground with transactional micro-decisions than major strategic ones.