Statistical & Financial Consulting by Stanford PhD
Home Page

Bagging (bootstrap aggregating) is a protocol for building ensemble methods for regression and classification. It says: estimate a given model on bootstrap samples of the given data and "average" the estimates. In the regression setting "averaging" is understood in the direct sense of this word. In the classification setting "averaging" is understood as averaging the estimated probabilities of the classes or averaging the 0/1 vote indicators of the classes (which is the same as trusting the majority vote). In any case we build an ensemble, where verdicts from many models are combined into one.

The idea of bagging is reducing the variance of the estimated model through averaging while keeping the bias the same. Since bootstrapped random models are i.i.d., the variance of the average model

is of the variance of each bootstrapped model. Further, the expectation of the average model is the same as the expectation of each bootstrapped model. Since the overall estimation error is given by

the error goes down with One might ask: why do we not do this every day? Don't we want the second term in (1) to always go to 0 while keeping the first term constant? Why don't we bootstrap and average in any setting amenable to bootstrap: linear regression, generalized linear model, survival analysis, etc? The issue is subtle. Remember we wanted to keep "the first term constant". Well, this is not quite possible. The

is higher for each bootstrapped model than it is for the original model estimated on the whole sample, generally speaking. The original sample has more information. Imagine an idealized setting where the sample has all the information necessary to make the right decision (e.g. whether the probability of default is positive) but a given bootstrap sample may not be representative of the whole set of scenarios of the universe. This does not go the other way around: whatever the bootstrap sample knows the full sample knows as well, albeit with different weights. In the example of assessing the possibility of a default the inference based on the full sample is false with lower probability than the inference based on a bootstrap sample. In other words, the bias of is lower than that of So bagging is not always more clever than the good old methods. In fact, it can be proven that with going to infinity, bagging is equivalent to Ordinary Least Squares in the standard linear regression setting.

Bagging lends itself naturally to asymptotically unbiased estimation of the predictive error. To ensure unbiasedness we have to test the predictive performance of the model out of sample, not in sample. By design we can do this on the fly, while running the bagging algorithm. For each data point, we predict the dependent variable using the aggregate of the models estimated on the bootstrap samples not containing the given data point. We record the so-called out-of-bag (OBB) error as the prediction minus the truth (regression) or the indicator of discrepancy between the prediction and the truth (classification). The absolute values of OBB errors are averaged over the data points. This average converges to the true expected predictive error as and the sample size converge to infinity.


Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.

Breiman, L. (1994). Bagging Predictors. Department of Statistics, University of California Berkeley. Technical Report.