I did a series of blog posts on different machine learning techniques recently, which sparked a lot of interest. You can see part 1, part 2, and part 3 if you want to learn about classification, clustering, regression, and so on.
In that series I was careful to differentiate between a general technique and a specific algorithm like decision trees. Classification, for example, is a general technique used to identify members of a known class like fraudulent transactions, bananas, or high value customers. Read this machine learning post if you need a refresher or are wondering quite what bananas have to do with machine learning.
But once you’ve decided that your problem is addressable with classification, you still need to pick the right algorithm. And there are many to choose from. So I’m going to take a few of them and give a basic explanation, just like for techniques in the previous series. Starting with … decision trees.
What is a Decision Tree?
Have you played Twenty Questions? The goal is to guess what object the “answerer” is thinking of by asking a series of questions that can only be answered by “yes” or “no”. If you’re playing well, then each answer helps you ask a more specific question until you get the right answer. You can think of that set of questions as a decision tree that guides you to more specific questions and ultimately the answer.
Or imagine you are calling a rather large company and end up talking to their “intelligent computerized assistant,” pressing 1 then 6, then 7, then entering your account number, mother’s maiden name, the number of your house before pressing 3, 5 and 2 and reaching a harried human being. You may think that you were caught in voicemail hell, but the company you called was just using a decision tree to get you to the right person.
So that’s what decision trees look like in real life. Let’s look at them in a machine learning context. Let’s imagine that we are going to use a decision tree algorithm to classify our customers by likelihood of churning (leaving us for a competitor).
How Does a Decision Tree Work in Machine Learning, Exactly?
Since we’re working with a specific algorithm we need to have some understanding of what the data looks like. We’ll start with something simple like this table below and train a model on that.
We’ll work with just five columns of data, also referred to as attributes or variables:
- The customer number identifies the customer and is unique to that customer.
- Age is an integer.
- Income is an integer rounded to the nearest hundred.
- Gender is a single letter.
- Churned identifies class members, so this customer has churned which means she has recently stopped being a customer.
I only show one row, but presumably we’re a large company and there are many thousands or even millions of rows to work with. Initially we’ll work with those middle three columns as predictors and see if we can use age, income, and gender to predict churn. Simplifying only slightly, we can now just point an algorithm at this data set and it will churn out an answer. Let’s look behind the scenes at how it does that.
First step is to find the predictor that will give the best initial split of the data. Finding this involves testing all the predictors and their different values. Imagine that after this calculation, it’s discovered that age is the best initial predictor, with individuals 37 and under being more likely to churn than individuals over 37. So now the algorithm can split the data set into two parts, ideally two roughly equal parts. We now have two nodes on the tree, one root node and two leaf nodes.
At this point we can repeat the process on each of these leaf nodes. Perhaps age is again the best predictor for the under 37 year olds (under 27 more likely to churn than over 27) while gender or income becomes a better predictor for those 37 and older. Now we have seven nodes on the tree, or four leaf nodes. And associated with each node is a percentage: 65% of the customers in this group churned, 13% in this group churned and so on.
In fact, here is an actual diagram of part of a decision tree. (we cut off some of the deeper nodes). Take a look at node nine on the right hand side of the third row. It contains customers with income > 129,500, though the actual income levels are not shown on this particular diagram. The model predicts that members of this group will not churn.
The confidence in this prediction is 85.2% and it’s supported by 223 customers or 17.5% of the original 1274 customers. On the other side of the diagram, node four shows the first prediction of churn, in this case for customers with incomes between 41,500 and 89,500. One more split to node 14 shows a prediction of churn with 100% confidence.
We could repeat this process until we run out of data to split, but it’s useful to set some criteria for stopping. One or more of the following would do:
- We reach a predetermined number of levels in the tree like seven
- A node has fewer than 10 (pick a number) records
- A node has fewer than 5% (pick a number) of the original data set’s records
Interpreting the Results from a Decision Tree
Now we’ve finished building the decision tree let’s see what we’ve got. Each decision gives us two nodes, so if we stopped at seven levels, then we’ve got 128 leaf nodes in the tree. (This assumes, of course, that no nodes terminate prematurely due to insufficient data, which in fact did happen in that diagram above).
If we look at one of those nodes in detail we might find a leaf node that groups women aged 27 to 29 with incomes between 28,000 and 34,000 as being 70% likely to churn. And we’d have 127 more nodes like that with different characteristics. Think of these leaf nodes as segmenting our customer base in 128 different ways. If the head of marketing has a limited budget to prevent churn, then this exercise tells them where to focus scarce funds with a targeted message to maximize impact.
Making a More Accurate Decision Tree
But before we spend any money, we’d like to know how good this model is. If you recall the process from that previous machine learning techniques article, you’ll know that I didn’t use all that training data up creating the initial model. I kept back 20% to 40% that I can use to test that initial model.
Now I can use that “held-aside” data to classify customers the model hasn’t seen before to assess how accurate the model actually is. (If you’d like a little more insight on this process, go search for “confusion matrix”). Assuming it’s accurate enough, now the marketing group can start spending money.
But what if it’s not accurate? What can we do? There are lots of possibilities and lots of things to look out for, including picking a different algorithm. But assuming that a decision tree is appropriate here, let’s look at the data the algorithm worked on.
First, age, income, and gender may not be particularly good predictors of churn. I just picked those three to make the example simple. Machine learning isn’t magic. If you’ve got data that’s weakly correlated with what you are trying to predict, then you need better data. Perhaps you should run a graph analysis on your customer base to flag if a given customer is connected closely to people who churn, and then go back and build a new model. Maybe there are other predictors you can add: number of calls to support, dropped service, changed usage patterns, time to renewal and so on.
In fact, it would not be unusual to have thousands of columns potentially available instead of the 5 I started with above in this simplified example. And just as the algorithm could find the best predictor with 3 columns of data, it could do the same with thousands, though of course that will take more time and CPU resources.
Data Preparation for Decision Trees
Preparing data could be a whole topic in its own right, so I’ll just pull out a few items of interest that arise from this particular example. Notice that I didn’t suggest training the model using customer ID, even though it was the first attribute in the table of data. Those identifiers are unique to each individual. I could create a model that was very accurate at identifying churn but that would be based on the specific identifiers in the training data.
Introduce new customers with different IDs and the model would be useless because it had never seen any of those IDs before. It would be like training a model to predict people’s incomes given social security number. I’ll get 100% accuracy for the social security numbers in the training data, and 0% accuracy on any other numbers. Such models don’t generalize.
The opposite of a predictor that is different for each row is one that is more or less constant and the same for each row. If I’m working with women’s clothing, for example, you’d expect most customers to be female. (That was most, not all, making that column mostly constant). In that case, gender may be a less useful attribute to use in your model in this situation.
If you’re a data engineer, helping to ready datasets used for machine learning you need to work closely with the data science team to understand what they need. In addition to the points above you’d also want to make sure that your sample is random and not biased (don’t pick all the new customers from July that were captured via a highly targeted campaign), unless that’s the specific requirement. Also look out for attributes with too many missing values (suppose we didn’t ask customers for income, so only have that information for 5% of them).
Where Did That Machine Learning Answer Come From?
In closing, there’s one last characteristic of this algorithm that might be useful. Somebody could look at the model and ask: “these customers are 80% likely to churn? Why is that?” You can ask the scoring engine to return the decision tree “rule” and provide a good answer to that question. That’s not always the case with other algorithms.
As use of ML and AI becomes more widespread, knowing why a particular prediction or answer was generated can be important, if not essential. For example, regulators might want to know why a particular loan application was rejected, and an organization would want to understand and correct a pattern of discrimination that crept in due to the operation of a particular algorithm.
Decision trees are relatively transparent and humans can interpret their “thought process”. The same cannot be said about neural networks, for example, which from a human perspective operate in a pretty opaque manner.
If you'd like to try out building a data lake and using machine learning on the data, Oracle offers a free trial. Register today to see what you can do.
Source: Oracle Big Data Blog posts