I recently attended a SAS course called “Data Mining Techniques: Theory and Practice”. This course was taught by Michael Berry of Data Miners, Inc. Michael and his colleague Gordon wrote some well known books including the one I received a copy of as part of the training, Data Mining Techinques for Marketing, Sales, and Customer Relationship Management. This was my first introduction to actual data mining theory and this course was a great place to start. The rest of this post is an overview of some of the key takeaways that I expect to apply to my work, but certainly there is a lot more information than I can include here. I have not put any of these techniques into action yet, so feel free to add some comments if I say something incorrectly. Also, look forward to more posts about my real experiences in prepping data for data mining and hopefully even creating some models at least for my own practice.
The techniques discussed in the training were Regression, Decision Trees, Neural Networks, Memory-Based Reasoning, Clustering, Survival Analysis, Market-Basket Analysis, Link Analysis, and Genetic Algorithms. One of the really interesting components of the training was discussing the difficulty in defining the right business question and then getting the correct transformed data set together. I hope to post more in the future on some common transformations that happen in prepping data, but for now I will discuss the techniques I expect to be most common in my type of work: Regression and Decision Trees, along with some data exploration done by Clustering.
Regression models are one of the most common techniques and the concept is familiar to people who have worked with statistics and analytics. Basically you are building a model which forms a best fit line for the initial data set used in development (called the “training” data). Once this best fit line is created, you can predict future values by passing the known attributes (such as month and product type) into the linear equation and seeing at what point on the line the target value fits. With regression models you want a small number of input variables and determining these can be done by choosing known key factors, adding variables one at a time based on which one performs best on it’s own, or using a decision tree to determine which variables are most predictive. Regression models are really good if working with a continuous target variable, such as a dollar amount or percentage, rather than a binary result, such as ‘Returning Customer’ vs ‘One Time Purchaser’. Some of the advantages of this method over others are that regression is usually the easiest model to understand, can usually be modeled without complex data mining software, and it finds global patterns based on only a few inputs.
Decision trees are rules set up in a tree structure, similar to a data flow diagram. One major advantage of decision trees is you can visualize which attributes are used to determine the target value. Decision trees can also be used for data exploration, even if the final model does not utilize a decision tree for predictive purposes. This technique works very well if there are multiple paths to the same target value. A key thing to remember when using decision trees is that once you determine the best model on your training data you need to run a separate validation data set against the model. The validation set will let you find specific rules that were strong in your training data but not strong in the validation set. Once you have run the model with your validation data, you want to prune the tree so that the final decision point is a good fit for both test and validation data, this way you aren’t building a predictive model that really only works for the initial data set. Also be careful to normalize values that might change frequently so that your model will continue to work in the future (this concept applies to all techniques). It’s also important to make sure each leaf of your tree is correct for a reasonable number of records — meaning in most cases don’t allow leaves to be created that only represent 5 records in your training data, since there is probably a less granular level that can give you better predictive results. Decision trees are best when the target variable is a classification and they have the benefit of being able to work with a large number of input variables.
Another useful technique is clustering. There are a variety of algorithms that can be used, but the concept is fairly simple. Clustering is a form of undirected data mining where you are not specifying a target variable but simply looking for groups of records with similar characteristics. It is a good way to take a large data set and break it down to relevant segments. Sometimes the segments that are created are best suited as input variables into a different type of model (such as regression or decision tree models), but you can tell a lot just by looking at the means of numeric values. So while clustering is a very useful technique, it doesn’t normally lead to any predictive model. It will, however, tend to lead to a lot more questions about how the segments vary.
A couple other techniques that I recommend reading about further (and I plan to try out in the future) are survival analysis and market-basket analysis. Survival analysis is a way to model time-to-event problems, usually customer focused questions such as how long it will take for a customer to leave or stop making purchases. Market-basket analysis (or association rules) relate to what behaviors tend to occur together. The basic concept is answering the retail question “Which items are purchased in the same transaction?” A way association rules may be useful in my work is to help define data quality checks based on pairings that are highly unlikely.
Hopefully this is a good overview of what I think will be useful. In the future I should be posting with a more detailed understanding after I have created a few practice models, most likely with Microsoft Data Mining tools.