The reality is that predictive analytics can be considered like a science
experiment, or more correctly, lots of science experiments. As we know with
science experiments we are testing a hypothesis, we may suspect we know the
outcome, but we can’t really be sure until we have completed the actual experiment.
In our case, we decide what we would like to predict, then we ask ourselves ‘what
data can plausibly be correlated to what we are trying to predict?’, then we
put that data into a predictive analytics tool (e.g. 11Ants Model Builder, SAS
Enterprise Miner, IBM’s SPSS Modeller, etc) and proceed to build a model, and
back test it and see how well it is able to actually predict. Sometimes we get
a satisfactory outcome, and sometimes we don’t.
This blog is about applying predictive analytics to real world problems in business, science and government.
Saturday, October 29, 2011
Microsoft Data Explorer - Predictive Analytics Next Inflection Point?
A very interesting development from Microsoft which may well
bring a major inflection point in the adoption of
predictive analytics.
O n October 13, 2011 a post appeared on Tim Mallalieu's Blog.
Tim is a Group Program Manager at Microsoft, and his blog post revealed that
for the last 14 months Microsoft have been very quietly working on developing
an ETL tool. The product has been assigned Microsoft Codename Data Explorer (and was previously referred to as Montego). ETL stands for Extract, Transform and Load. At risk of greatly
oversimplifying it, what an ETL tool enables you to do is to access data from
multiple storage silos, bring only the parts of the data that you need, and bring them into a place that you
can work on the data. So for example we can have some data in our CRM system,
and some data in our transactional database, and (say) weather data from the Weather
Channel website; with an ETL tool, we
can bring all the data into the same place, so it it is ready for us to work
with, we can constantly refresh the data so that it is always the most recent
data (but with the ETL performed on it). Not just that it is there in one place,
but that only the relevant data that we actually require is there, in one
place, ready for us to do something with. This is actually pretty neat, though
is not particularly new. For more see ETL .
What is brand new though, and what really holds the
potential to be game changing, is that now nearly anyone will be able to do this. It was previously a highly complicated
affair, laden with integration and a lot of complex work – people didn’t want
to try this at home, and if they weren’t experts they didn’t really want to try
it at work either. But it appears that
Microsoft have completely changed that (I say that without having used the product, but the vision here is very clear). This is good in its own right (ETL has
many uses), but when it comes to predictive analytics it is has some quite
serious implications (in a positive way).
At risk of stating the obvious, predictive analytics at its
heart relies on building predictive models, and predictive models rely on data –
often lots of data, and often disparate data.
Our experience when we launched our desktop modelling tools
(11Ants Model Builder, 11Ants Customer Churn Analyzer, and 11Ants CustomerResponse Analyzer) was that we overnight were able to trivialize the
technically most complex part of model building (I say overnight...the
technology took us over three years to develop, but one night it was finished!) and suddenly people that hadn’t
contemplated building predictive models found they could build them with very
little effort. Now suddenly a business analyst with a basic understanding of
Excel could build models with no requirement for understanding machine learning
algorithms, etc. Also experienced model builders could lift the quality of
their models with reduced development time (for a paper on how to beat 85% of
the submissions in an international data mining contest with less than 50
minutes work refer to 11Ants Customer Churn Analyzer outperforms 85% of Submissions in International Predictive Analytics Contest).
However as all good students of the Theory of Constraints
know, as soon as we remove one constraint, we clear the way for the next
constraint to become the rate limiter (there is a good book about this incidentally:
The Goal). Well it turns out that the
new rate limiting step after the trivializing of the algorithm selection and evaluation is the
extraction and preparation of the data.
So a big challenge to running multiple experiments involving
different and disparate data (we’ve already solved the problem of doing different experiments
on the same data by automating the algorithm selection with tools like 11Ants
Model Builder) is bringing in the data. It lives all over the place, and when
you have to herd cats to bring the data into one place to be able to begin
working on it, then you have a legitimate constraint.
Effectively Microsoft appears to have made the herding of
the cats a lot easier, for a lot more
people. When you make something accessible to a lot more people, interesting
things start happening, a lot more science projects get performed, and a lot more useful applications begin to develop.
If a relatively small company had developed this, I am not sure that I would make the claim that it was
going to herald an inflection point, but the fact that it is Microsoft means
that it there is going to be plenty of air time, plenty of credibility, plenty
of sales effort, and generally plenty of attention and I think we will find
that the combination of all the above will indeed cause an inflection point.
Thursday, October 6, 2011
Where to Begin with Predictive Analytics and Black Swan Events
James Taylor has written an excellent and informative article for Information Management Where to Begin with Predictive Analytics it is recommended reading for executives considering how to most prudently begin deploying predictive analytics, but struggling to determine a clear starting point.
I couldn't agree more with all the points he makes. It also makes me think of a conversation I had yesterday about the flip side of this: Black Swan events - events that are extremely rare, but when they happen completely destroy the performance of any predictive model you have made.
The point James makes about focussing on transactional versus strategic decisions in my view minimizes the Black Swan concern, which is a natural concern for many people when looking at predictive analytics critically (or indeed any endeavour where one makes future decisions based upon past behaviour).
Micro-decisions made on the basis of thousands to millions of historical examples are likely to contain quiet a few black swan events within them (often known as outliers) but they are diluted by the massive number of 'normal' transactions. Further the magnitude of investment in each micro-decison (recommended by the model) is relatively trivial, so you are not relying on each prediction being perfect, rather a statistically significant number being good enough to give you better performance than random.
Accordingly, the Black Swan effect does not become a major element of risk - whereas it certainly can be when making a single high magnitude decision (e.g. how high to place the back-up generator at a nuclear reactor in Fukushima, based upon historical high water marks or whether it would have been safe to make a bet two years ago that there would be no earthquakes in Christchurch, New Zealand).
No doubt someone will be able to point out examples where a black swan event can affect micro-decisions too (and please feel free to), however the point remains that generally you are on significantly more solid ground with transactional micro-decisions than major strategic ones.
I couldn't agree more with all the points he makes. It also makes me think of a conversation I had yesterday about the flip side of this: Black Swan events - events that are extremely rare, but when they happen completely destroy the performance of any predictive model you have made.
The point James makes about focussing on transactional versus strategic decisions in my view minimizes the Black Swan concern, which is a natural concern for many people when looking at predictive analytics critically (or indeed any endeavour where one makes future decisions based upon past behaviour).
Micro-decisions made on the basis of thousands to millions of historical examples are likely to contain quiet a few black swan events within them (often known as outliers) but they are diluted by the massive number of 'normal' transactions. Further the magnitude of investment in each micro-decison (recommended by the model) is relatively trivial, so you are not relying on each prediction being perfect, rather a statistically significant number being good enough to give you better performance than random.
Accordingly, the Black Swan effect does not become a major element of risk - whereas it certainly can be when making a single high magnitude decision (e.g. how high to place the back-up generator at a nuclear reactor in Fukushima, based upon historical high water marks or whether it would have been safe to make a bet two years ago that there would be no earthquakes in Christchurch, New Zealand).
No doubt someone will be able to point out examples where a black swan event can affect micro-decisions too (and please feel free to), however the point remains that generally you are on significantly more solid ground with transactional micro-decisions than major strategic ones.
Subscribe to:
Posts (Atom)