A Gentle Introduction into Random Forests

I can still picture exactly where I was the first time I heard someone suggested implementing a Random Forest as a modeling technique. We had hired a contractor to help with a particular modeling problem, and I literally laughed out loud and asked the individual if he was serious.  Hearing it for the first time, the word “random” certainly does not offer much to instill confidence in predictive accuracy. “How would something random add any sort of value,” I wondered. However, I later learned Random Forests have proven to be an extremely effective modeling technique, able to protect against correlated variables and bias.

In this post, I will provide context around what Random Forests are and what value they bring to business.

 

Overview of Random Forests

Random Forest models are based on the Decision Tree modeling technique which is based on splits of data rather than linear correlation. Developed by Breiman (2001), the algorithm follows a Bagging technique (coincidentally, also developed by Brieman several years prior) except that in addition to randomizing bootstrapped samples of data, Random Forest also randomizes bootstrapped sampled of predictors (Figure 1).

 

Figure 1: Evolution of Random Forests
rf

 

In Figure 1, notice how there is a single tree for the CART model. The next evolution, Bagging, employs multiple trees based on bootstrapped samples of data (James, et al, 2014). We refer to this as ensemble modeling, because we use multiple models simultaneously to determine a prediction for a single observation (Seni & Elder, 2010). Ensemble modeling has proven to occasionally yield more accurate results at the expense of model interpretability. In Bagging, notice how the top of the tree is generally the same “Important Predictor.” This leads to correlated models. The correlation can be addressed by implementing a random factor (called perturbation) which only selects a subset of predictors with each bootstrap sample. Random Forest, another ensemble method, employs this approach. In the end, these ensemble techniques combine all of their models together and export a composite score or predictor (for example, through voting), for each observation.

While it does operate under the guise of a black box, Random Forests do leave us a few minor clues as to what’s going on underneath the hood. In statistics packages, there are generally some “variable importance” plots which can be conjured once a model is fit. These plots allow us to see which variables are most “interesting” but don’t necessarily explain why they’re interesting or even give a correlation sign.  Also, if needed, we can generally extract a few actual “trees” or splits from within the model construct, but since there are generally so many trees, simply reviewing a handful of them closely would not be helpful, and in fact, may be misleading.

 

Value of Random Forests

 

The value we realize from Random Forests is that it protects against correlated variables and gives each predictor more of a chance to be recognized in the model rather than be overshadowed by a few strong or greedy predictor variables. This is awesome when there exists high multicollinearity or a high number of predictors present. Overall, these additions lead to greater predictive accuracy (Seni & Elder, 2010). The downside of the Random Forest model is that it is not interpretable to the analyst or the business. It will be very difficult to peel back the covers and determine “why” a particular observation was classified in such a manner. The business must learn to “trust” the model through cross-validation and constant model performance monitoring.

 

 

For more Random Forest fun (I can tell you are hardly able to contain your excitement), head on over to either this other author’s blog post and one more for more “gentle” conversations regarding your new favorite landmark in Statisticsland, the Random Forest.

 

 

 

References

Breiman, L. (2001). Random Forests, random features. Berkeley: University of California. 1.1, 4.4.

Chapman, C. & Feit, E. (2015). R for Marketing Research and Analytics. Switzerland: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). Introduction to Statistical Learning. New York: Springer.

Seni, G. & Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool.