Saturday, December 24, 2011

Reducing Customer Churn at a Mobile Telecommunications Operator

We recently released a case study about how mobile telephone operator 2degrees was able to utilize predictive analytics to identify those customers at risk of churning (leaving). This customer analytics solution utilizes propensity modelling to identify which customers are statistically most at risk of churning. 2degrees used 11Ants Customer Churn Analyzer to achieve an increase in identification rate of 1275%. Read the full case study here.

Monday, December 12, 2011

Predictive Analytics to Increase Marketing Response Rate by 166%

We recently released a case study on how e-commerce site Trade Me increased response rate in an electronic direct mail (EDM) campaign using predictive analytics. This customer analytics solution utilizes propensity modelling to identify which customers are statistically most likely to respond to a campaign offer. Trade Me used 11Ants Customer Response Analyzer to achieve an increase in response rate of 166%. Read the full case study here.

Monday, November 7, 2011

McKinsey Quarterly - Package on 'Big Data'

Very interesting reading in McKinsey Quarterly on 'Big Data' this month. It requires (free) registration, but very good articles:

Are you ready for the era of 'big data'?
Radical customization, constant experimentation, and novel business models will be new hallmarks of competition as companies capture and analyze huge volumes of data. Here’s what you should know.

MIT professor Erik Brynjolfsson, Cloudera cofounder Jeff Hammerbacher, and Butler University men’s basketball coach Brad Stevens reflect on the power of data.

Companies are learning to use large-scale data gathering and analytics to shape strategy. Their experiences highlight the principles—and potential—of big data.

CEOs should shake up the technology debate to ensure that they capture the upside of technology-driven threats. Here’s how.

Saturday, October 29, 2011

Microsoft Data Explorer - Predictive Analytics Next Inflection Point?

A very interesting development from Microsoft which may well bring a major inflection point in the adoption of predictive analytics.


On October 13, 2011 a post appeared on Tim Mallalieu's Blog. Tim is a Group Program Manager at Microsoft, and his blog post revealed that for the last 14 months Microsoft have been very quietly working on developing an ETL tool.  The product has been assigned Microsoft Codename Data Explorer (and was previously referred to as Montego). ETL stands for Extract, Transform and Load. At risk of greatly oversimplifying it, what an ETL tool enables you to do is to access data from multiple storage silos, bring only the parts of the data that you  need, and bring them into a place that you can work on the data. So for example we can have some data in our CRM system, and some data in our transactional database, and (say) weather data from the Weather Channel website;  with an ETL tool, we can bring all the data into the same place, so it it is ready for us to work with, we can constantly refresh the data so that it is always the most recent data (but with the ETL performed on it). Not just that it is there in one place, but that only the relevant data that we actually require is there, in one place, ready for us to do something with. This is actually pretty neat, though is not particularly new.  For more see ETL .


 What is brand new though, and what really holds the potential to be game changing, is that now nearly anyone will be able to do this.  It was previously a highly complicated affair, laden with integration and a lot of complex work – people didn’t want to try this at home, and if they weren’t experts they didn’t really want to try it at work either.  But it appears that Microsoft have completely changed that (I say that without having used the product, but the vision here is very clear). This is good in its own right (ETL has many uses), but when it comes to predictive analytics it is has some quite serious implications (in a positive way).


 At risk of stating the obvious, predictive analytics at its heart relies on building predictive models, and predictive models rely on data – often lots of data, and often disparate data.  


 Our experience when we launched our desktop modelling tools (11Ants Model Builder, 11Ants Customer Churn Analyzer, and 11Ants CustomerResponse Analyzer) was that we overnight were able to trivialize the technically most complex part of model building (I say overnight...the technology took us over three years to develop, but one night it was finished!) and suddenly people that hadn’t contemplated building predictive models found they could build them with very little effort. Now suddenly a business analyst with a basic understanding of Excel could build models with no requirement for understanding machine learning algorithms, etc. Also experienced model builders could lift the quality of their models with reduced development time (for a paper on how to beat 85% of the submissions in an international data mining contest with less than 50 minutes work refer to 11Ants Customer Churn Analyzer outperforms 85% of Submissions in International Predictive Analytics Contest).


 However as all good students of the Theory of Constraints know, as soon as we remove one constraint, we clear the way for the next constraint to become the rate limiter (there is a good book about this incidentally:  The Goal). Well it turns out that the new rate limiting step after the trivializing of the algorithm selection and evaluation is the extraction and preparation of the data.


The reality is that predictive analytics can be considered like a science experiment, or more correctly, lots of science experiments. As we know with science experiments we are testing a hypothesis, we may suspect we know the outcome, but we can’t really be sure until we have completed the actual experiment. In our case, we decide what we would like to predict, then we ask ourselves ‘what data can plausibly be correlated to what we are trying to predict?’, then we put that data into a predictive analytics tool (e.g. 11Ants Model Builder, SAS Enterprise Miner, IBM’s SPSS Modeller, etc) and proceed to build a model, and back test it and see how well it is able to actually predict. Sometimes we get a satisfactory outcome, and sometimes we don’t. 


 So a big challenge to running multiple experiments involving different and disparate data (we’ve already solved the problem of doing different experiments on the same data by automating the algorithm selection with tools like 11Ants Model Builder) is bringing in the data. It lives all over the place, and when you have to herd cats to bring the data into one place to be able to begin working on it, then you have a legitimate constraint.


 Effectively Microsoft appears to have made the herding of the cats a lot  easier, for a lot more people. When you make something accessible to a lot more people, interesting things start happening, a lot more science projects get performed,  and a lot more useful applications begin to develop.


 If a relatively small company had developed this, I am not sure that I would make the claim that it was going to herald an inflection point, but the fact that it is Microsoft means that it there is going to be plenty of air time, plenty of credibility, plenty of sales effort, and generally plenty of attention and I think we will find that the combination of all the above will indeed cause an inflection point.

Thursday, October 6, 2011

Where to Begin with Predictive Analytics and Black Swan Events

James Taylor has written an excellent and informative article for Information Management Where to Begin with Predictive Analytics it is recommended reading for executives considering how to most prudently begin deploying predictive analytics, but struggling to determine a clear starting point.

I couldn't agree more with all the points he makes. It also makes me think of a conversation I had yesterday about the flip side of this:  Black Swan events - events that are extremely rare, but when they happen completely destroy the performance of any predictive model you have made.

The point James makes about focussing on transactional versus strategic decisions in my view minimizes the Black Swan concern, which is a natural concern for many people when looking at predictive analytics critically (or indeed any endeavour where one makes future decisions based upon past behaviour).

Micro-decisions made on the basis of thousands to millions of historical examples are likely to contain quiet a few black swan events within them (often known as outliers) but they are diluted by the massive number of 'normal' transactions. Further the magnitude of investment in each micro-decison (recommended by the model) is relatively trivial, so you are not relying on each prediction being perfect, rather a statistically significant number being good enough to give you better performance than random.

Accordingly, the Black Swan effect does not become a major element of risk - whereas it certainly can be when making a single high magnitude decision (e.g. how high to place the back-up generator at a nuclear reactor in Fukushima, based upon historical high water marks or whether it would have been safe to make a bet two years ago that there would be no earthquakes in Christchurch, New Zealand).

No doubt someone will be able to point out examples where a black swan event can affect micro-decisions too (and please feel free to), however the point remains that generally you are on significantly more solid ground with transactional micro-decisions than major strategic ones.


Thursday, September 29, 2011

Sports Analytics

Tom Davenport writes a good review on Moneyball. A different take on in than Roger Ebert would present - but you will probably learn more from Davenport's commentary. You can read it here: http://blogs.hbr.org/davenport/2011/09/six_things_your_company_has_in.html .

Sports analytics is becoming serious business. MIT annually hold a sports analytics conference the aptly named  MIT Sloan Sports Analytics Conference . There is also a great company based in Chicago, but operating globally, which houses massive volumes of sports data (also aptly!) named STATS . STATS have been collecting sports data for over 30 years and supply most of the player stat data shown on screen in televised sports events. As you can imagine the demand for this type of data is increasing, as is the sophistication of the data which can be presented. STATS recently purchased the Israeli company SportVU which among other things automates the gathering of player and ball movement data. This space is set to become very interesting for data scientists.

Feel free to add to the comments - what innovative things would you be aiming to do with the base sports statistical data if you were a company like STATS?

Wednesday, September 28, 2011

Predicting Hospital Admissions - 11Ants Analytics Excels in First Milestone of Heritage Health Prize

Some time ago I wrote about the Heritage Provider Network Health Prize . The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to a hospital within the next year, using historical claims data. This competition is hosted by the revolutionary and cool site Kaggle.

If you can do this better than anyone else you get $3 million USD for your trouble. Nice. Though as you would expect when someone is nonchalantly handing out $3 million - you are far from the only one that benefits...according to Heritage in 2006 well over $30 Billion USD was spent on unnecessary hospital admissions. This is a prime example of deriving value out of data and refreshingly not just financial value - anybody who has spent any time in a hospital knows that it is not much fun. You will also know that the ripple from hospital admissions travel far and wide, from the patient themselves, to medical staff, to employers, to friends and family, to insurance companies, and so it goes on.

One of the few universal truths would certainly have to be the more people we can keep out of hospital the better.

The premise is that if we can identify patterns in claims data that helps to predict which patients are at risk of being admitted to hospital (a very expensive event on every metric) we can intervene, and can even afford to spend a reasonable amount of resource on such interventions to try to keep the patient out of hospital.

We at 11Ants Analytics entered this competition and to date are pretty pleased with the results. At the end of the first milestone we are placed 13th out of 971 players - which places us in the top 1.3% of contestants.

So if you are an executive at an insurance company, a health maintenance organization, a provider network, or a government health scheme, and would be interested in applying advanced predictive analytics to pulling tens of millions of dollars, to hundreds of millions of dollars out of your costs, feel free to get in touch. We would be very interested to discuss.

Saturday, May 21, 2011

Putting Predictive Analytics to Work in Operations

A great white paper by James Taylor at Decision Management Solutions. Definitely recommended reading:

Putting Predictive Analytics to Work in Operations

"Using Decision Management to maximize the value of predictive analytics.

Predictive analytics applied to operational decision making is the next major source of competitive advantage. The most successful companies are using Decision Management to put predictive analytics to work powering the day-to-day decisions that impact performance most..."

Friday, March 18, 2011

Predictive Analytics - The New Avenue for Cost Savings at Transport and Logistics Companies

It is hard to squeeze blood out of a stone, as the saying goes - and the CFOs of most logistics companies around the world probably find this saying to be particularly appropriate right about now. However CFOs who think they've run out of opportunities to cut costs, may just find themselves pleasantly surprised.

This discussion will center around using predictive analytics to better forecast resource requirements at transport and logistics companies. Predictive analytics is the technique of exploiting patterns in transactional data to make predictions ahead of time. In this case we will look at it in the context of predicting package volumes, on the assumption that if you know your volumes, you can more efficiently schedule your resources.

The case for looking at this aggressively is quite strong. If we look at a company with $4 million per month of variable cost and we can manage to reduce this by 1% through more accurate forecasting, we are talking about a $40,000 per month ($480,000 p.a.) saving, if we can get this up to 5% this figure moves to $200,000 per month ($2.4 million p.a.). It is not unreasonable to expect that we can expect to improve our forecasting accuracy within this range.

Generally business executives will prefer to operate a business that is well scheduled as opposed to reactive. There are a number of good reasons for this; one being it is less expensive to deploy scheduled resource than emergency resource, all things being equal you have a better chance of maintaining high service levels in a scheduled operation than one which is reactive, and not to be under-estimated a scheduled operation seems to increase the chances of management and staff alike maintaining their sanity.

We can safely say that a business running closer to the scheduled end of the scheduled/reactive continuum is better for all stakeholders (including the fork lift driver's wife who knows if her husband will be home for dinner or not..).

So, beginning with looking at high level tasks and resources:

If you are a transportation company, your job is to deliver goods from point A to point B. The fundamental steps are:

1. Pick goods up from your customer.
2. Deliver to hub for sorting.
3. Deliver goods to your customer's customer.

Clearly a gross over-simplification, but sometimes it pays to keep things simple.

So what are our resources?
  • Trucks (our own)
  • Trucks (third parties, regularly scheduled)
  • Trucks (third parties, scheduled at short notice)
  • Labor
In keeping with our spirit of over-simplification, let's also assume that our resource requirement scales in a somewhat linear fashion, though it has capacity steps in it. So that is to say that to deliver 10,000 packages we will require approximately double the resource than we would require to deliver 5,000 packages.  Again this over-simplifies the case, but it does not affect the fundamentals of our discussion.

This benefits us, in that it makes our objective very clear - if we can accurately predict package volume, we can accurately predict resource requirement. So our objective is to predict package volume as accurately as we possibly can.

Now we will add one qualification to this objective. It is infinitely better to be able to forecast resource requirements 24 hours out, rather than 1 hour out, in fact one week out would be even better still. So we can state our objective (for example) as:

 To increase our ability to predict package volumes 24 hours ahead of time, so that we can more efficiently schedule resources.

The CFO will like this. She is sick of paying for labor that was not required. Not nearly as sick, mind you, as she is of paying the invoices to trucking companies that had to be called in in at the 11th hour in order to provide extra capacity, and get orders out on time.  With this objective we are hoping to get closer to the theoretical nirvana of 'no tasks without resources, and no resources without tasks'.

So now we have established what we are trying to do, and why, let's look at how.

How is, by bringing in techniques that have been in use in another industry - the insurance industry - since 1762. Just as insurance companies take an actuarial approach to expected payout for any given customer - we are going to take a similar approach to forecast the projected volumes on any given day.

To clarify how it works in insurance, let's use the example of life insurance...life insurance companies don't typically have one premium rate for all customers. Rather they ask you specific questions, like: your age, your gender, are you a smoker/non-smoker, etc. Then they go back and compare what you have told them with how long people of your age, gender, smoker-status, tend to live from historical data. Obviously if you are statistically likely to live longer, your premium should be lower than someone who is statistically likely to live for a shorter period.

But returning to our case, rather than taking into account age, smoking, gender, etc we are going to analyze our data at a granular level and find the drivers that affect package volumes. We are then going to build a predictive model, which helps us determine what our volumes are statistically likely to be tomorrow, and this will become an important reference piece of information for our resource planning for the next day.

So far all this sounds good. In fact it makes a lot of sense. Rather than making general, unscientific assumptions about our volumes, we will take a scientific approach to it. However, it is about this point that most executives at transportation companies instinctively start coming up with reasons that while this may work in other industries, it wouldn't work in their industry.

We don't really have much data. An insurance company can ask all those questions. All we have really is 'date and number of packages shipped'.

If I had a dollar for every industry that thought their data was different, this blog would be being written from a super yacht in the Mediterranean. Yes - it is a fact that every industry is different, but this fact does not mean there is not more data there than you would initially imagine, nor that there are not patterns in that data which can be exploited. For example. Most people look at a date field like '03/26/2011' and see one piece of information. Actually you would be surprised how much information we can extract out of a seemingly innocuous field like '03/26/11':

  • Year (2011)
  • Quarter (First)
  • Month (March)
  • Day of Month (26)
  • Day of Week (Saturday)
  • Week of Year (12)
  • Week of Quarter (12)
  • Week of Month (3)
  • Public Holiday (No)
  • Weekend (Yes)
  • Days Since Last Public Holiday (36)
  • Days Until Next Public Holiday (31)
  • Days Until End of Month (5)
  • Season (Spring) (or if in Southern Hemisphere, Autumn)
So there you go, we have miraculously transformed our date field into 14 pieces of more granular information. More importantly the granularity allows us to consider which things have correlation to package volumes much more than the '03/11/2011' ever did alone. An executive may intuitively observe:

  • We do tend to transport more at the end of the month, when our clients are doing end of month promotions.
  • There is a great deal of seasonality in our volumes.
  • Quarter ends are especially busy times.
  • We ship nothing on public holidays.
  • Fridays are always busy.
 So that was just the date. Now let's look at the next thing historical package volumes:

Let's say our info says: 03/26/2011  5,920 Packages Shipped to 3,450 Locations

Let's see what we can do with that:

  • Volume today.
  • Volume yesterday.
  • Volume today - volume yesterday.
  • Volume one week ago.
  • Volume today - volume one week ago.
  • Volume one month ago.
  • Volume today - volume one month ago.
  • Volume one year ago
  • Volume today - volume one year ago.
  • Etc
As you can see these things are also likely to be pertinent, as they will capture and reflect growth trends, etc.

We can also look at making our data more granular by extracting data at a customer level.

Finally we also have data in our arsenal, that we do not even own -third party provided data. Examples of this are:

  • Economic confidence data
  • GDP data
  • Unemployment data
  • Stock market data
  • Building permit data (if, for example, we ship primarily construction products)
The point of all of the above is not to prescribe exactly what data we should be throwing into the mix, but rather to encourage executives to think about which data may have a correlation to their volumes, and to illustrate that they have a lot more to work with than they may have imagined at first glance.

Okay - I concede we may have the data. But how could we possibly analyze that data and exploit it without a team of PhD.s in statistics or mathematics, or whatever it is?!

Relax. There are companies like 11Ants Analytics that do this sort of thing - and things infinitely more complicated - every day. They will walk you through the whole process analyze the data, and even customize their software solution for you so that you can integrate it into your business operations.

But we could never do our scheduling based upon what a black box program told us to do - what if it was wrong?!

The same could be said for the speedometer on your car - sometimes you just want to trust that the system is doing its job. However that being said there may be events that crop up on the day that render the prediction made 24 hours prior obsolete. This is not a big deal, you create your best scientific guess 24 hours out, and then you use it as an important frame of reference, which can always be over-ridden if the evidence requires this. You are still likely to have more resource scheduled correctly, than you would have if forecasting less scientifically.

Another consideration is that there may be trends that occur during the day that can also be modeled in a similar fashion, that serve as an 'early warning system' . Effectively when we see spikes in volumes exceeding what we forecast during the day, we can even model what this may mean today has in store for us. Even these inputs a few hours earlier are more useful to us than at the very last minute.

It is absolutely impossible that we can forecast volumes 100% accurately using predictive analytics.


Correct. You certainly will not - these are statistical predictions. However keep in mind, all we are looking for is improvement. The question to ask yourself is - by using analytics techniques like this, could we possibly get 1% better at forecasting? 5% better? 10% better? The answer will be different for everyone, but as you've seen above, you don't need to experience a huge improvement in forecasting to start saving some serious money - and you certainly don't need to be predicting at 100% accuracy, simply predicting better than we are now.

It just sounds like a lot of work on our end, extracting the data preparing it, etc!

Probably not as much as you would imagine. However regardless, this decision needs to be evaluated on a straightforward ROI basis. As one would expect savings of this magnitude are going to require some investment, but the justification for the effort should not be too hard demonstrate. We are probably going to have to look pretty hard, elsewhere, to find savings that equate to pulling 1% - 5% out of our variable cost base.

If you have any questions about any of this, feel free to drop me an email.

Wednesday, March 2, 2011

Webinar - An Introduction to Predictive Analytics with the 11Ants Predictive Suite

If you are interested in learning a little bit more about predictive analytics, we will be doing a webinar next week - I've posted all the invite details directly below:

Webinar - An Introduction to Predictive Analytics with the 11Ants Predictive Suite


Thursday, March 9, 2011 - 3:00pm US Eastern Time


A webinar for those that are new to Predictive Analytics, including an introduction to the 11Ants Predictive Suite.

The webinar is an introduction to Predictive Analytics. It will cover techniques on how to approach Predictive Analytics as a beginner - with a particular focus on the importance of preparing your data and understanding the outcomes you hope to achieve.

The good new is that even though Predictive Analytics can be a complicated subject, using the techniques portrayed in the webinar, coupled with 11Ants Analytics software, just about anyone that has Microsoft Excel and some data to model can get going within a few hours.

Who should attend? Anyone who has an interest in Predictive Analytics and learning how to derive value from data.

To register please go to our webinar registration page.

At the same time, you may like to visit the 11Ants Analytics website to see what is new.

Thank you!

Tuesday, February 22, 2011

An Introduction to Predictive Analytics Contests and How a Non-Data Scientist Ranked in the Top 20% in a Global Predictive Analytics Contest

Just another of the many benefits of the social web is the ability to crowd-source expertise to help with business problems.

A great example of this is predictive analytics - or data mining - contests. The first time this was probably ever noticed by  mainstream business, was when Netflix announced a prize of one million dollars to anybody on the planet who could increase their ability to predict if someone was likely to enjoy a movie or not - The Netflix Prize. Genius concept. For a mere million dollars (not even at risk unless results were delivered) plus some logistics costs, they obtained access to literally the smartest data scientists in the world. Consider for a minute, if they set up an internal department tasked with solving this problem, a million dollars annually would not take them very far - a handful of staff, and of course they would be limited by whomever was looking for a job at that time. Instead, via this competition 51051 contestants entered from 186 different countries! You can be sure that contestants (now their outsourced predictive analytics department) ran the gamut from NASA scientists to university students.

Since this time  Kaggle has emerged, which is really a great idea. What they provide is a platform to host competitions like the Netflix Prize, but for any company. This idea is right at it's time in my opinion, and I believe has a tremendous amount to offer the world. Kaggle recently announced they will be hosting the Heritage Health Prize, which I have referred to in an earlier post. This is a $3 Million USD prize which I believe should make a meaningful difference to health care, and I suspect will cause a tipping point in increased focus on predictive analytics in this space.

 For a number of years, the seductively named Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining has held an annual contest - the KDD cup  http://www.sigkdd.org/kddcup/

An organization will typically donate some interesting data, and then data scientists from around the world will descend on it, and see how well they can do at building predictive models. In 2009 ORANGE (the seventh largest telecom operator in the world) put up some of their data about customer behavior. The objective was to estimate the churn, appetency and up-selling probability of their customers around the world.

I was recently asked how 11Ants Model Builder would perform in such a competition. Which I thought was a fair and valid question - only I did not know the answer. The software has been designed to allow non-experts to have access to predictive analytics. So my thesis was that it would probably perform quite well, but not as good as an expert data scientist - let's say, in a field like this, 'average'.

So I thought about what a fair test would be. I am not a data scientist. So I decided I would have a go at entering this competition myself using 11Ants Model Builder.

I loaded the data into Excel ensured it was formatted correctly, this took me about 15 minutes. (For anyone that is an expert, I did not apply an transformation to the data, simply used the raw data). Then I hit the button 'ANALYZE DATA & GENERATE MODELS'. Then I left it to run over the weekend.

The data mining competition then has a separate data set to test predictions on (without telling you the real values). You then upload your predicted results from this test set, and they immediately score your results (how close your predictions were to the true values), and put you onto the leader board.

The result: a non-data scientist finds himself placed in position 743 of 3,668 results - putting me in the top 20% of the contestants in a global data mining competition.


This part will make most your eye's glaze over, but in case anyone is interested in what the lift curve looked like on the up-sell model, this is how it looked. A lift chart shows how many times more likely a given customer is to be amenable to an up-sell, than if we randomly selected one. So at the level of ten it means we are ten times more likely to get a customer who will respond favorably to our up-sell offer. You can see as we work our way through the 'list' the level of lift drops - this is because the statistical certainty reduces as we move further down the list.


The major point I would want anyone to take away from this, is that advanced predictive analytics are accessible today, and to just about anyone. Consider that Orange is a phone company with 52 Billion Euros  revenue in 2009, and you can be pretty sure that they have got a pretty talented and expensive predictive analytics department. We can also deduce that some of the data scientists in this competition may have managed to do better than ORANGE (speculation, but quite likely to be the fact).

We should all be quite encouraged by the fact that over the course of a weekend (actually only 15 minutes of that was work, the balance was my laptop processing away) a single individual with some inexpensive software can build a predictive model that ranks him in the top 20% of data scientists in the world - even though he is not a data-scientist!

Thursday, February 3, 2011

Predictive Analytics at a Global Computer Manufacturer

Imagine you are a global computer manufacturer. You may or may not own the factories that produce your hardware, but one thing that you do own for sure - which is a significantly more complex beast - is effectively one massive data factory.

Consider for a moment all the data being produced by your organization, for your organization, and being dumped into your organization's prolific data warehouse system(s):

  • Marketing Data
  • Quality Data
  • Efficiency Data
  • Sales Data
  • Returns Data
  • Web Generated Data
  • Call Center Generated Data
  • Analyst Data
  • Investor Relations Data
  • Social Media Data
  • Maintenance Plan Data

And so it goes on...this barely scratches the surface of the data generated, but let's use it as the starting basis to make some sort of a point, which has been a long time coming...

As this is a blog on predictive analytics, the question returns as always, to how could predictive analytics be useful to such an organization, and why?

The overarching answer to the question 'why?' is likely something approximating: because our organization has developed a keen interest in looking forward, to what is about to happen, rather than just reporting what has happened in the past. This idea probably has general appeal to just about any sober executive, conceptually it is a great thought, after all.

But grandiose as it is, it doesn't really solve any specific business problem, rather it is just the theme that unites the solving of hundreds, thousands, or millions of business problems. Therefore, it worthwhile to consider 'why?' on a more micro-level. So we will address this relative to the data types listed above (very superficially) to gain a flavor for what predictive analytics can achieve:

  • Marketing Data - Which prospects are most likely to respond to an offer for product x? Which prospects are likely to respond favorably to an up-sell? How can I predict which segment a customer falls into with only partial information about them?
  • Quality Data - What are the big predictors to quality problems? Given scenario a,b,c,d - what is the quality level I can expect at the end of the production line?
  • Efficiency Data - What are the big predictors of more efficient manufacturing lines? Given scenario a,b,c,d - what is the efficiency level I can expect at the end of the production line?
  • Sales Data - How much are we likely to sell this day, week, month, quarter? What are the big predictors of sales volume?
  • Returns Data - Exactly which order to which customer is most likely to get returned? (A scored list for each daily shipment, with statistical probabilities of likelihood of their being a return issue attributed to each shipment, so that interventions can be placed on those classified as high risk). What are the big predictors of goods being returned to us?
  • Web Generated Data - Which behaviors on the website are most likely to lead to a purchase? Therefore which customers are more likely to purchase? Do we treat them differently?
  • Call Center Generated Data - Given variables a,b,c,d and e - what are my staffing requirements going to be?
  • Analyst Data - Which analysts are responsible for giving us the most coverage? How do I score analysts in terms of influence/importance?
  • Investor Relations Data - How do I score investor relations inquiries in terms of importance, so that the appropriate level of support can be provided to each inquiry?
  • Social Media Data - Are there any correlations between the fire hose of social media data being generated and business objectives? Can I predict future events from specific occurrences in the blogosphere?
  • Maintenance Plan Data - Which customers are most at risk of not renewing their maintenance plans? Which prospects are most likely to purchase maintenance plans? What are the biggest drivers of people not renewing? Which customers should I not aggressively offer maintenance plans to, as they are statistically unprofitable?

Most large organizations, and global computer manufacturers are no exception, have a dedicated predictive analytics department. Most frequently it is tucked away in a deep dark corner of the organization. They are usually tooled up with software from companies like IBM/SPSS or SAS - both great companies. The people in this department are clever people, and well trained in the software (which is what it requires - this software cannot be learned in a quick tutorial - try days to weeks). These people are also constantly troubled by the morons in the business who want to try some new predictive analytics project. Like any problem of resource allocation if your issue is not considered an organization priority it will likely not get done - even if it would help your specific department immeasurably. That is if you even know that this predictive analytics capability exists within your organization - trust me, most people don't.

Generally its fair to say that the performance of predictive analytics takes on the aura of alchemy - something you most certainly don't want to try at home, and most probably are too intimidated to even try at work... A regular business analyst dare not even consider it. So you either convince the predictive analytics department to do it, or put it in the 'too hard' basket.

At 11Ants Analytics we've been busy painting a new future, it looks something like this:

'Put the analytics capabilities into the hands of the people most motivated to solve the problem'

Imagine a world where anyone that could use Excel could perform advanced predictive analytics. The capability becomes widely deployed to subject matter experts (e.g. marketing, quality, etc) through-out an organization, even if they knew nothing about data mining and statistics.

That is what 11Ants Model Builder delivers - it is an incredibly easy to use Excel Add-In that allows anyone to practice advanced predictive analytics, and be up and running within minutes. Science fiction no more - go check it out.

Wednesday, February 2, 2011

Predictive Analytics World - March 14-15, 2011

A great conference for anyone interested in predictive analytics is Predictive Analytics World. The next one is coming up in San Francisco March 14-15, 2011. The organizers have been good enough to offer a 15% discount on the 2 day conference pass to readers of my blog.  http://www.predictiveanalyticsworld.com/sanfrancisco/2011/To take this offer up, use the following code when registering: TF11ABP.

Tuesday, February 1, 2011

Predictive Analytics for Border Protection

Imagine that you’ve been given responsibility for protecting a nation’s borders.  Your job may be any or all of the following: stop bad people from entering, stop illegal substances from entering, stop weapons from entering, or it may be to ensure that importers pay their fair share of customs duties. 

Whatever the objective, you will face a similar problem: how to block the most bad guys, with minimum disruption to the good guys. The minimum disruption to the good guys is important – both from a goodwill and from a cost standpoint – no government in the world has unlimited resources. Blocking the bad guys is obviously also very important - the more time you waste on good guys, the less likely you are to get an extra bad guy.

Predictive analytics hold massive potential to assist customs services, border protection agencies, and homeland security agencies to more efficiently sort through the ever increasing amount of noise thrown at them and isolate those individuals/shipments that should be given special attention.

Let’s look at an example of determining which incoming shipment is at risk of having under-declared customs duties. You can equally apply this to any other type of inspection task:

How is this done now?  Depending upon where in the world you are, the answer is largely a combination of ‘gut feel’, rules based instructions, and/or random checking. Depending upon the experience of the staff, there becomes an implicit understanding as to which shipments are perceived to be of greater risk. However, this experience is not evenly distributed – an inspector with 20 years of experience is by definition likely to have a better ‘gut feel’ than one on her second week on the job.

Fortunately, what many customs services around the world do have is a massive database recording every shipment that has ever entered their jurisdiction, along with every known case of  under-declared customs duties. Accordingly they possess a very valuable (yet under-utilized) asset: 1) lots of examples of ‘events’ (i.e. in this case: under declared customs duties) and 2) lots of potential inputs (or predictors) to help predict that event - the potential inputs are all the other information we have about the shipment, the bill of lading provides a plethora of such information (e.g. country of origin, ship name, port of loading, etc)

Though not many people automatically think of it like this, if correctly analyzed, this database holds the cumulative knowledge of every single customs officer that has ever found an under-declared shipment in the history of data gathering within that organization. The problem is this is literally impossible for a human to synthesize and therefore exploit - which is why not many people think of it in this way, this capability is relatively new.

So imagine if we could take all this historical data, throw some sophisticated predictive machine learning algorithms at it, and start analyzing that data for patterns which can help predict which shipments are likely to be under-declared. The emergent patterns are described in a 'predictive model'.

The next step is to run our database of today's incoming shipments against this model.  The model will apply a score to every shipment, the score would be a measure of risk – the higher the score, the more statistically likely it is to have under-declared its customs duties.  The output would be a list sorted by score, ranking the incoming shipment from most at risk to least at risk. Our inspection officers start at the top of the list (not the bottom, or the middle) and work their way down.  We can even statistically generate the optimal point in the list to stop.

This is not science fiction, but quite achievable. In some work I was involved with along these lines, at 11Ants Analytics,  a computer picked candidate for inspection was three times more likely to require inspection than a randomly selected one. The ultimate implementation of such a system would continue learning all the time, and continue having inspectors knowledge fed into it.

We can equally apply this principal to profiling of individuals, dangerous shipments, or many other things – the opportunities are massive.  All we require is: 1) an understanding of what would be useful for us to predict and 2) ancillary data relating to the examples which we can interrogate for some form of correlation.

If you are in a customs agency anywhere the world and interested in learning more about this type of application, please don’t hesitate to drop me an email.

Thursday, January 20, 2011

Increasing Customer Retention at Health Clubs with Predictive Analytics

A major problem in the health club industry is customer retention - it may well be the industry's single largest issue. Hence the constant aggressive push to get customers signed up and in the front door, at a rate faster than they are exiting out the back door. I have seen figures showing that as many as 40% of customers churn in the average health club, regardless of the exact numbers, it is a known fact in the industry that it is a higher number than any health club manager wants it to be; and obviously if it can be reduced, it adds directly to the club's bottom line.

Equally plenty of members renew their memberships year in, year out. Accordingly, any customer retention strategy should involve two key components: 1) identifying those customers at risk of leaving and 2) targeting those at risk with appropriate interventions.

It is beyond the scope of this blog (or my expertise!) to go into intervention methods. However, I would like to discuss briefly on identification of at risk customers - which is where predictive analytics comes in.

Like all businesses health clubs have limited resources, and it is absolutely pointless for them to to invest resources to try and retain each and every customer, when a good deal of them are not at risk in the first place. If a customer is identified as 'at risk' there is a strong business case to be built around investing resources in trying to retain that specific customer (theoretically you could afford to invest up to $1 less than the cost of acquiring a new customer, and still be ahead of the game), conversely if they are not 'at risk' and are going to re-sign anyway, you may just as well burn the money as hand it over to that specific customer in the form of an incentive or time invested in chasing them. 

The other consideration is, it is far easier to actively try to retain 2,000 customers than 4,000 customers, so by segmenting, and making the size of the job more manageable, it makes it more likely that a health club will do something - and if we know nothing else, we know that doing something is usually better than doing nothing.

So we have a clear business case for identifying which customers are most at risk of churning. Our next mission then, would be to take our database of current members and identify which ones are 'at risk' and which ones are 'loyal'. Ideally we would take it one step further than this, and be able to rank our whole customer database in rank order from those statistically 'most at risk' to those 'least at risk'. The benefit of doing this, is that it provides our sales/retention staff with a sequenced work list, which they would start at the top of and work their way down. This simple act in itself would give us comfort that our resources are being focused on those that most require them. This can even be taken one step further, and we can - again using statistical methods - determine the statistically optimal place in the list to stop.

 Though we have a business case, and a reasonably clear vision of what would be useful, the problem is that for most health clubs, the scenario I have outlined above is closer to science fiction, than something they perceive they can practically deploy within their club. So the status quo prevails: 1) do nothing, 2) treat all customers as equally at risk, or 3) do some random haphazard interventions with no real science behind who is targeted and who is not.

In conclusion let's discuss how we can take this utopian vision and turn it into an actionable reality. Ironically for many health clubs this vision can be actualized faster than it took me to write this blog - literally.

Most health clubs have a reasonable amount of data about their members. Let's imagine that we have all the data about every member of our club for the last five years, lined up in an Excel spreadsheet. Every row is a unique member, every column is information we know about that member. We call these input columns, and they would be things like: her age, her marital status, # of visits in January 2010, number of visits in January 2009, etc. payment method, # of address changes, average time she spends in health club, etc, etc it would be no problem to have 100 or even 500 columns, and in the very last column (our target column) we add a label 'loyal' or 'at risk'. Anybody that left our club previously is labeled as 'at risk' and 'anybody' who re-signed is labeled as 'loyal'. We would eliminate from the spreadsheet anyone who had not had been with us a year yet, as we don't know what they are likely to do.

Now I will skip over the math here, which nobody would want to try at home, but you can take it on good authority that there are patterns within all the input columns that can help to predict the customers propensity to churn (as you would well expect). A human cannot detect these patterns, but there are software applications that can, then once patterns are defined, the software can look at the patterns in this year's (or this month's) customers and output the exact previously mentioned ranked list, complete with the optimal point in the list to stop making interventions.

I would encourage anybody that is interested to visit www.11AntsAnalytics.com and take a look at the QuickStart tutorial video about 11Ants Model Builder on the home page, which will better show the process (the data is different, but it won't require much imagination for it all to make perfect sense). Feel free to email me if you have questions about this - doing this sort of thing is ten times easier than most people imagine.

Although this post has been focused on customer retention in health clubs, identical principals apply to any subscription based or long term service business (telcos, cable television companies, banks, insurance companies, etc).