Statistical & Financial Consulting by Stanford PhD
Home Page

When analyzing a particular phenomenon, rarely does one have a clear picture about what the true model is. Even when the structure and distributions are clear, one may be unsure about how many predictors must be used in the model. This leads to uncertainty about the right number of parameters, as typically each predictor has one or more parameters associated with it. If model A is a particular case of model B then, on any particular data set, the estimate of model A will have a bigger average prediction error that the estimate of model B. Here the prediction error for any data point in the data set is the distance between the true value of the phenomenon and the predicted value of the phenomenon. Model B has extra parameters, extra flexibility compared to model A. It is easier for model B to match the data points. In fact, if model B has as many parameters as there are data points, then almost always model B can match all data points perfectly. So the average prediction error is 0. It does not get better than that. So should we prefer model B to model A?

The answer is "no", of course. The estimate of model B is simply complacent, indulging into every whim of the given data set (training set). Under the new rule, on new data set (testing set) this estimate may perform very badly, as it will have to predict a completely new collection of data points. Now the model is estimated, now the parameters cannot be adjusted to please the new set of data points... The only way for a model to be successful is to develop a character, to be strong. Instead of trying to match every single point, the model must try to identify some inner structure of the data points. If the whole structure cannot be estimated accurately because of the small size of the training set, the model must describe only the main frame of the structure, its most important part. This part will have a smaller number of parameters, amenable to relatively accurate estimation. Having gone through the estimation stage, we now have something solid, we now have something of value. The estimated structure will help us in performing prediction on each data point of the new set. A way of choosing a model with character is called a model selection method.

In general, model B does not need to have too many parameters. It may have just several parameters more than model A. It will perform better on the training set for pure algebraic reasons, as discussed. By will it perform substantially better? Is the discrepancy between model A and model B significantly big to voice preference for model B, or is it just an artifact of the discrepancy in complexities? A typical model selection method helps us answer the question by applying one or more of the following ideas:

1] for each candidate model, compute the performance score according to a formula which turns into an estimate of expected prediction error on a new data point for a known, standard family of models;

2] for each candidate model, compute the performance score which has a "penalty" term increasing with the number of parameters;

3] split the training data set into subsets and use some of them for estimation and the others for independent prediction diagnostics.

All these ideas underlie the model selection criteria described in the "Subcategories" section. In the overview above word "prediction" is used in a very general sense. Everything said is true for both regression and classification problems. In the regression setting, the phenomenon we are trying to predict is a numerical variable. In the classification setting, the phenomenon of interest is a categorical variable, not a number.



Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, pp. 267-281.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2), pp. 461-464.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. 36, pp. 111-147.

Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. J. Roy. Statist. Soc. 39, pp. 44-7.

Zhang, P. (1993). Model selection via multifold cross-validation. Ann.Statist. 21, pp. 299-311.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy assessment and model selection. International Joint Conference on Articial Intelligence.

Efron, B. & Tibshirani, R. (1997). Improvements on cross-validation: the 632+ bootstrap: method. J. Amer. Statist. Assoc. 92, pp. 548-560.

Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley, New York.

Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classication and Regression Trees. Wadsworth.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press.