Statistical & Financial Consulting by Stanford PhD
Home Page

Random Forest is a machine learning method for regression and classification. It combines many decision trees to produce an overall verdict. As such, random forest qualifies as an ensemble method. In the regression setting regression trees are used, while in the classification setting classification trees are used.

Random forest utilizes the idea of bagging. It says: create many bootstrap samples of the data, grow a tree on each of them and average the results. There is one nuance: executing bagging in a straightforward manner, with no tweaks, may lead to high correlation among the trees. For example, imagine a large data set where classification is to be performed. The data set contains one strong binary predictor, one medium-effect binary predictor and one weak binary predictor. Any two bootstrap samples may easily overlap by more than 70% of the data and, in that case, they are likely to generate exactly the same tree. There will be no difference whatsoever since we always classify each terminal node to the most prevalent class, and the law of large numbers will lead to the same value in each tree. The example is idealized, of course. In practice we would deal with many more variables.

Averaging highly correlated trees may result in high variance of the average and, therefore, high overall predictive error. To mitigate the issue, random forests go one step beyond bagging. They de-correlate the trees by considering only a subset of the variables at each split of a node. Typically, the number of candidate predictors m is much smaller than the total number of variables J. It can be viewed as a parameter to be fine-tuned via cross-validation or other procedures. The starting values recommended by several references are m = Int(J / 3) for regression and m = Int(J 0.5) for classification. As Hastie et. al. (2008) point out, the suggested starting values are not always the optimum since the optimal m tends to decrease with the fraction of relevant predictors in the sample.

Let n be the sample size. The random forest algorithm is simple:

        1. For i = 1, ..., B:

             1.1 Draw a sample with replacement of size n (bootstrap sample).

             1.2 Using the generated sample, grow a tree by repeating the following steps until
                   the minimum node size has been reached.

                  1.2.1 Randomly select m variables out of the complete list of J variables.

                  1.2.2 Regression setting: find the variable / split-point which allows for the biggest
                          decrease in the root mean square error, mean absolute error or a similar metric.
                          Classification setting: find the variable / split-point which allows for the biggest
                          decrease in the misclassification rate, Gini index or a similar metric..

                  1.2.3 Split the node into two daughter nodes.

        2. Regression setting: calculate the model as the average of the grown trees.
            Classification setting: calculate the model as
               A) the vote based on the average class probabilities implied by the grown trees, or
               B) the majority vote of the grown trees (equivalent to averaging the 0/1 vote indicators
                   of the grown trees) .

Note that random forests allow for deep trees if the minimum node size is set small. This is in contrast with stand-alone decision trees, which are first grown to the fullest and then pruned to a smaller version according to a certain regularization criterion.

Random forest allows for 1) asymptotically unbiased estimation of predictive error, 2) testing potential explanatory power of a particular variable, 3) comparing explanatory power of different variables. First, we take each data point and predict the dependent variable using the random forest based on bootstrap samples not containing the given data point. We record the so-called out-of-bag (OBB) error as the prediction minus the truth (regression) or the indicator of discrepancy between the prediction and the truth (classification). The absolute values of OBB errors are averaged over the data points. This average converges to the true expected predictive error as B and the sample size converge to infinity.

Second, we assess the importance of each variable present in the estimated random forest. For each tree containing the given variable, we collect the corresponding out-of-bag observations (observations not included in the bootstrap sample used to grow the tree). We randomly permute the variable on the out-of-bag observations and calculate the resulting increase in the predictive error of the tree in question. The predictive error is aggregated over all the trees containing the variable. This is the out-of-bag estimate of the variable importance. We can now test whether it is bigger than 0 via bootstrap. We can also compare the estimated importance of two different variables. We can test if the two variables differ in importance via bootstrap.

Alongside gradient boosting, random forests have three important properties.

  • They draw their robustness from the robustness of the underlying decision trees. If a certain predictor has nothing to do with the dependent variable it is unlikely to be utilized substantially in the estimated model.

  • The granular nature of random forests allows them to fit highly non-linear and discontinuous surfaces.

  • Random forests require relatively little fine-tuning. There are few specifications to vary and even those are optimized in a straightforward manner.
All of the above makes random forest one of the most competitive "off the shelf" methods. "Off the shelf" means precisely what the third property was implying. Random forests require little courting, flowers and holding hands. They are ready to run relatively quickly. And the accuracy is comparable to more whimsical methods, like deep neural networks. And oftentimes it is better.


Breiman, L. (2001). Random Forests. Machine Learning, 45 (1): 5–32.

Breiman, L., & Ghahramani, Z. (2004). Consistency for a Simple Model of Random Forests. Department of Statistics, University of California Berkeley. Technical Report.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

Zhu, R., Zeng, D., & Kosorok, M. R. (2015). Reinforcement Learning Trees. Journal of the American Statistical Association, 110 (512): pp. 1770–1784.

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63: 3–42.

Burkov, A. (2019). The Hundred-Page Machine Learning Book. Amazon Australia Services, Inc.