A Gentle Introduction into Random Forests

I can still picture exactly where I was the first time I heard someone suggested implementing a Random Forest as a modeling technique. We had hired a contractor to help with a particular modeling problem, and I literally laughed out loud and asked the individual if he was serious.  Hearing it for the first time, the word “random” certainly does not offer much to instill confidence in predictive accuracy. “How would something random add any sort of value,” I wondered. However, I later learned Random Forests have proven to be an extremely effective modeling technique, able to protect against correlated variables and bias.

In this post, I will provide context around what Random Forests are and what value they bring to business.

 

Overview of Random Forests

Random Forest models are based on the Decision Tree modeling technique which is based on splits of data rather than linear correlation. Developed by Breiman (2001), the algorithm follows a Bagging technique (coincidentally, also developed by Brieman several years prior) except that in addition to randomizing bootstrapped samples of data, Random Forest also randomizes bootstrapped sampled of predictors (Figure 1).

 

Figure 1: Evolution of Random Forests
rf

 

In Figure 1, notice how there is a single tree for the CART model. The next evolution, Bagging, employs multiple trees based on bootstrapped samples of data (James, et al, 2014). We refer to this as ensemble modeling, because we use multiple models simultaneously to determine a prediction for a single observation (Seni & Elder, 2010). Ensemble modeling has proven to occasionally yield more accurate results at the expense of model interpretability. In Bagging, notice how the top of the tree is generally the same “Important Predictor.” This leads to correlated models. The correlation can be addressed by implementing a random factor (called perturbation) which only selects a subset of predictors with each bootstrap sample. Random Forest, another ensemble method, employs this approach. In the end, these ensemble techniques combine all of their models together and export a composite score or predictor (for example, through voting), for each observation.

While it does operate under the guise of a black box, Random Forests do leave us a few minor clues as to what’s going on underneath the hood. In statistics packages, there are generally some “variable importance” plots which can be conjured once a model is fit. These plots allow us to see which variables are most “interesting” but don’t necessarily explain why they’re interesting or even give a correlation sign.  Also, if needed, we can generally extract a few actual “trees” or splits from within the model construct, but since there are generally so many trees, simply reviewing a handful of them closely would not be helpful, and in fact, may be misleading.

 

Value of Random Forests

 

The value we realize from Random Forests is that it protects against correlated variables and gives each predictor more of a chance to be recognized in the model rather than be overshadowed by a few strong or greedy predictor variables. This is awesome when there exists high multicollinearity or a high number of predictors present. Overall, these additions lead to greater predictive accuracy (Seni & Elder, 2010). The downside of the Random Forest model is that it is not interpretable to the analyst or the business. It will be very difficult to peel back the covers and determine “why” a particular observation was classified in such a manner. The business must learn to “trust” the model through cross-validation and constant model performance monitoring.

 

 

For more Random Forest fun (I can tell you are hardly able to contain your excitement), head on over to either this other author’s blog post and one more for more “gentle” conversations regarding your new favorite landmark in Statisticsland, the Random Forest.

 

 

 

References

Breiman, L. (2001). Random Forests, random features. Berkeley: University of California. 1.1, 4.4.

Chapman, C. & Feit, E. (2015). R for Marketing Research and Analytics. Switzerland: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). Introduction to Statistical Learning. New York: Springer.

Seni, G. & Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool.

Economics of CRM Modeling

These are several conversations I have had with many marketing leaders; maybe you can relate?

 

Conversation 1

Marketing Manager: If we get $2 in incremental sales per customer mailed, we should be able to increase our list by 10,000 and then make an additional $20,000 in incremental sales, right?

Me: Customer response and incremental sales are not linear!

Conversation 2

Marketing Manager: If you build a model, it will drive incremental sales, right?

Me: We’re already mailing nearly all of our customers.

Marketing Manager: Right, but if you build a model, it will drive more sales, though, right?

Me: No, that’s not quite how it works…just a sec…here (comes back with handy-dandy economic CRM diagram)

net-contribution-economic-diagram

 

Here’s the PDF, for people who like PDFs.

I built this little diagram to bring all the key metrics into a single place that I can then use to demonstrate the relationships between cost, file size, incremental sales, ROI, and even more importantly the NET CONTRIBUTION. Here, I define [NET CONTRIBUTION] = [Incremental Sales] – [Cost].

Many leaders get caught up on ROI, but again, ROI is not a linear thing. Once the model is built, if we invest more in going deeper in the mail file, we cannot expect to maintain the same level of ROI.  ROI might be fantastic for the first decile, but the marketing manager needs to also consider, “will I get enough volume of incremental sales from this tactic at this depth to even make the ROI worth it?”

I like this  graph because it also emphasizes that there exists some “arbitrary incremental sales ceiling.”  In other words, if we mail EVERYONE in our entire CRM database, we will probably generate a lot of incremental sales, but it will probably be at an even GREATER expense…which is why net contribution (and ROI) is nearly zero at the full size of the file.

The goal, then, of marketing analytics is to optimize on maximum net contribution. Maximum ROI will likely only be a few records translating to minimum incremental sales, and Maximum Incremental Sales will be the whole file (which will probably also be super-high cost).  So, maximum net contribution is where the focus should be.

Once our “customer selection model” is established, we can use other models that “choose offers, choose times, or choose creative” to send our targeted customers in order to improve response or incremental sales for that group. For example, these types of models could include offer or product propensity models, lifecycle modeling, or creative/behavioral segmentation.  In other words, a customer selection model that chooses “who” to target won’t necessarily increase incremental sales from one campaign to the next (if circulation depth is held constant),* but an offer, timing, or creative model might be able to improve incremental sales because the “right message” and “right time” principles are addressed.

 

*Hey, I know there’s an argument to be made for improving models for customer selection and uplift modeling, etc. (which would boost incremental sales slightly on the same circulation), but that’s another discussion, mm-kay?