Random Forest is a machine learning method for regression and classification. It combines many decision trees to produce an overall verdict. As such, random forest qualifies as an ensemble method. In the regression setting regression trees are used, while in the classification setting classification trees are used.
Random forest utilizes the idea of bagging. It says: create many bootstrap samples of the data, grow a tree on each of them and average the results. There is one nuance: executing bagging in a straightforward manner, with no tweaks, may lead to high correlation among the trees. For example, imagine a large data set where classification is to be performed. The data set contains one strong binary predictor, one medium-effect binary predictor and one weak binary predictor. Any two bootstrap samples may easily overlap by more than 70% of the data and, in that case, they are likely to generate exactly the same tree. There will be no difference whatsoever since we always classify each terminal node to the most prevalent class, and the law of large numbers will lead to the same value in each tree. The example is idealized, of course. In practice we would deal with many more variables.
Averaging highly correlated trees may result in high variance of the average and, therefore, high overall predictive error. To mitigate the issue, random forests go one step beyond bagging. They de-correlate the trees by considering only a subset of the variables at each split of a node. Typically, the number of candidate predictors
m
is much smaller than the total number of variables
J. It can be viewed as a parameter to be fine-tuned via cross-validation or other procedures. The starting values recommended by several references are
m = Int(J / 3)
for regression and
m = Int(J 0.5)
for classification. As Hastie et. al. (2008) point out, the suggested starting values are not always the optimum since the optimal
m
tends to decrease with the fraction of relevant predictors in the sample.
Let n be the sample size. The random forest algorithm is simple:
1. For i = 1, ..., B:
1.1 Draw a sample with replacement of size
n
(bootstrap sample).
1.2 Using the generated sample, grow a tree by repeating the following steps until
the minimum node size has been reached.
1.2.1 Randomly select
m
variables out of the complete list of
J
variables.
1.2.2 Regression setting: find the variable / split-point which allows for the biggest
decrease in the root mean square error, mean absolute error or a similar metric.
Classification setting: find the variable / split-point which allows for the biggest
decrease in the misclassification rate, Gini index or a similar metric..
1.2.3 Split the node into two daughter nodes.
2. Regression setting: calculate the model as the average of the grown trees.
Classification setting: calculate the model as
A) the vote based on the average class probabilities implied by the grown trees, or
B) the majority vote of the grown trees (equivalent to averaging the 0/1 vote indicators
of the grown trees) .
Note that random forests allow for deep trees if the minimum node size is set small. This is in contrast with stand-alone decision trees, which are first grown to the fullest and then pruned to a smaller version according to a certain regularization criterion.
Random forest allows for 1) asymptotically unbiased estimation of predictive error, 2) testing potential explanatory power of a particular variable, 3) comparing explanatory power of different variables. First, we take each data point and predict the dependent variable using the random forest based on bootstrap samples not containing the given data point. We record the so-called out-of-bag (OBB) error as the prediction minus the truth (regression) or the indicator of discrepancy between the prediction and the truth (classification). The absolute values of OBB errors are averaged over the data points. This average converges to the true expected predictive error as
B
and the sample size converge to infinity.
Second, we assess the importance of each variable present in the estimated random forest. For each tree containing the given variable, we collect the corresponding out-of-bag observations (observations not included in the bootstrap sample used to grow the tree). We randomly permute the variable on the out-of-bag observations and calculate the resulting increase in the predictive error of the tree in question. The predictive error is aggregated over all the trees containing the variable. This is the out-of-bag estimate of the variable importance. We can now test whether it is bigger than 0 via bootstrap. We can also compare the estimated importance of two different variables. We can test if the two variables differ in importance via bootstrap.
Alongside gradient boosting, random forests have three important properties.