A Gentle Introduction into Random Forests

I can still picture exactly where I was the first time I heard someone suggested implementing a Random Forest as a modeling technique. We had hired a contractor to help with a particular modeling problem, and I literally laughed out loud and asked the individual if he was serious.  Hearing it for the first time, the word “random” certainly does not offer much to instill confidence in predictive accuracy. “How would something random add any sort of value,” I wondered. However, I later learned Random Forests have proven to be an extremely effective modeling technique, able to protect against correlated variables and bias.

In this post, I will provide context around what Random Forests are and what value they bring to business.


Overview of Random Forests

Random Forest models are based on the Decision Tree modeling technique which is based on splits of data rather than linear correlation. Developed by Breiman (2001), the algorithm follows a Bagging technique (coincidentally, also developed by Brieman several years prior) except that in addition to randomizing bootstrapped samples of data, Random Forest also randomizes bootstrapped sampled of predictors (Figure 1).


Figure 1: Evolution of Random Forests


In Figure 1, notice how there is a single tree for the CART model. The next evolution, Bagging, employs multiple trees based on bootstrapped samples of data (James, et al, 2014). We refer to this as ensemble modeling, because we use multiple models simultaneously to determine a prediction for a single observation (Seni & Elder, 2010). Ensemble modeling has proven to occasionally yield more accurate results at the expense of model interpretability. In Bagging, notice how the top of the tree is generally the same “Important Predictor.” This leads to correlated models. The correlation can be addressed by implementing a random factor (called perturbation) which only selects a subset of predictors with each bootstrap sample. Random Forest, another ensemble method, employs this approach. In the end, these ensemble techniques combine all of their models together and export a composite score or predictor (for example, through voting), for each observation.

While it does operate under the guise of a black box, Random Forests do leave us a few minor clues as to what’s going on underneath the hood. In statistics packages, there are generally some “variable importance” plots which can be conjured once a model is fit. These plots allow us to see which variables are most “interesting” but don’t necessarily explain why they’re interesting or even give a correlation sign.  Also, if needed, we can generally extract a few actual “trees” or splits from within the model construct, but since there are generally so many trees, simply reviewing a handful of them closely would not be helpful, and in fact, may be misleading.


Value of Random Forests


The value we realize from Random Forests is that it protects against correlated variables and gives each predictor more of a chance to be recognized in the model rather than be overshadowed by a few strong or greedy predictor variables. This is awesome when there exists high multicollinearity or a high number of predictors present. Overall, these additions lead to greater predictive accuracy (Seni & Elder, 2010). The downside of the Random Forest model is that it is not interpretable to the analyst or the business. It will be very difficult to peel back the covers and determine “why” a particular observation was classified in such a manner. The business must learn to “trust” the model through cross-validation and constant model performance monitoring.



For more Random Forest fun (I can tell you are hardly able to contain your excitement), head on over to either this other author’s blog post and one more for more “gentle” conversations regarding your new favorite landmark in Statisticsland, the Random Forest.





Breiman, L. (2001). Random Forests, random features. Berkeley: University of California. 1.1, 4.4.

Chapman, C. & Feit, E. (2015). R for Marketing Research and Analytics. Switzerland: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). Introduction to Statistical Learning. New York: Springer.

Seni, G. & Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool.

Machine Learning is not Artificial Intelligence

Remember the first time you heard the words “Big Data?” Well, there’s a new buzzword in town — “Machine Learning.”

Ok, when I say “Machine Learning,” what happens in your mind? What images have I conjured by saying “Machine Learning?” Maybe, you saw a brief shadow of a floating, intelligent, robotic metal squid, or a flying Keanu Reeves? Maybe, you heard the name “Ah-nold” or “I’ll be back” with occasional lasers flashing in the distance.

Well, I’m sorry to say that I’m here to burst your bubble. Pop! There it goes…  When we discuss within the context of statistics and analytics, Machine Learning is NOT the same thing as Artificial Intelligence.

Machine Learning isn’t even a super simple, intuitive approach to data modeling and analytics. Machine Learning basically has to do with the fact that technology has finally come so far as to allow computers to apply brute-force methods and build predictive models that were not possible 30 and even 15 years ago. You may have actually already heard of many Machine Learning algorithms — for example: Decision Trees, Neural Networks, Gradient Boosting, GenIQ, and even K-means clustering.  Many analytical tools, such as Python and R, already support these modeling techniques. The SciKit Learn package in Python offers a great tutorial in Decision Trees.

Ultimately, what I want you to walk away with is that, when we talk about statistics and analytics, Machine Learning isn’t some super-fancy, futuristic process that will enlighten all of your analytics capabilities. It is actually a set of functionality that already exists and can be drawn upon to create predictive models using heavy computer processing.

If you’re interested in learning more, I’ll recommend the book “Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data” by Bruce Ratner. He talks about many of these techniques, what they are used for, and how to avoid pitfalls..