On October 13, 2011 a post appeared on Tim Mallalieu's Blog. Tim is a Group Program Manager at Microsoft, and his blog post revealed that for the last 14 months Microsoft have been very quietly working on developing an ETL tool. The product has been assigned Microsoft Codename Data Explorer (and was previously referred to as Montego). ETL stands for Extract, Transform and Load. At risk of greatly oversimplifying it, what an ETL tool enables you to do is to access data from multiple storage silos, bring only the parts of the data that you need, and bring them into a place that you can work on the data. So for example we can have some data in our CRM system, and some data in our transactional database, and (say) weather data from the Weather Channel website; with an ETL tool, we can bring all the data into the same place, so it it is ready for us to work with, we can constantly refresh the data so that it is always the most recent data (but with the ETL performed on it). Not just that it is there in one place, but that only the relevant data that we actually require is there, in one place, ready for us to do something with. This is actually pretty neat, though is not particularly new. For more see ETL .
What is brand new though, and what really holds the potential to be game changing, is that now nearly anyone will be able to do this. It was previously a highly complicated affair, laden with integration and a lot of complex work – people didn’t want to try this at home, and if they weren’t experts they didn’t really want to try it at work either. But it appears that Microsoft have completely changed that (I say that without having used the product, but the vision here is very clear). This is good in its own right (ETL has many uses), but when it comes to predictive analytics it is has some quite serious implications (in a positive way).
At risk of stating the obvious, predictive analytics at its heart relies on building predictive models, and predictive models rely on data – often lots of data, and often disparate data.
Our experience when we launched our desktop modelling tools (11Ants Model Builder, 11Ants Customer Churn Analyzer, and 11Ants CustomerResponse Analyzer) was that we overnight were able to trivialize the technically most complex part of model building (I say overnight...the technology took us over three years to develop, but one night it was finished!) and suddenly people that hadn’t contemplated building predictive models found they could build them with very little effort. Now suddenly a business analyst with a basic understanding of Excel could build models with no requirement for understanding machine learning algorithms, etc. Also experienced model builders could lift the quality of their models with reduced development time (for a paper on how to beat 85% of the submissions in an international data mining contest with less than 50 minutes work refer to 11Ants Customer Churn Analyzer outperforms 85% of Submissions in International Predictive Analytics Contest).
However as all good students of the Theory of Constraints know, as soon as we remove one constraint, we clear the way for the next constraint to become the rate limiter (there is a good book about this incidentally: The Goal). Well it turns out that the new rate limiting step after the trivializing of the algorithm selection and evaluation is the extraction and preparation of the data.
The reality is that predictive analytics can be considered like a science experiment, or more correctly, lots of science experiments. As we know with science experiments we are testing a hypothesis, we may suspect we know the outcome, but we can’t really be sure until we have completed the actual experiment. In our case, we decide what we would like to predict, then we ask ourselves ‘what data can plausibly be correlated to what we are trying to predict?’, then we put that data into a predictive analytics tool (e.g. 11Ants Model Builder, SAS Enterprise Miner, IBM’s SPSS Modeller, etc) and proceed to build a model, and back test it and see how well it is able to actually predict. Sometimes we get a satisfactory outcome, and sometimes we don’t.
So a big challenge to running multiple experiments involving different and disparate data (we’ve already solved the problem of doing different experiments on the same data by automating the algorithm selection with tools like 11Ants Model Builder) is bringing in the data. It lives all over the place, and when you have to herd cats to bring the data into one place to be able to begin working on it, then you have a legitimate constraint.
Effectively Microsoft appears to have made the herding of the cats a lot easier, for a lot more people. When you make something accessible to a lot more people, interesting things start happening, a lot more science projects get performed, and a lot more useful applications begin to develop.
If a relatively small company had developed this, I am not sure that I would make the claim that it was going to herald an inflection point, but the fact that it is Microsoft means that it there is going to be plenty of air time, plenty of credibility, plenty of sales effort, and generally plenty of attention and I think we will find that the combination of all the above will indeed cause an inflection point.