Statistical & Financial Consulting by Stanford PhD
Home Page
CLASSIFICATION

In a classification setting we need to solve the following problem. We observe N objects. For each object we know the values of variables X1, ..., Xp. We also know that the objects are split into classes 1, 2, ..., K. For each object we know its class membership. We need to develop a statistical method that allows to identify class membership of a new object for which only the values of X1, ..., Xp are known.

There are many approaches to classification. They can be classified themselves according to the following criteria.

1] Parametric (logistic regression, linear discriminant analysis, naive Bayes) versus nonparametric (k-nearest neighbor, classification trees / CART, boosted classification trees / MART, random forests, support vector machines, neural networks, genetic algorithms).

2] Simple (logistic regression, linear discriminant analysis, naive Bayes, k-nearest neighbor) versus complex (classification trees / CART, boosted classification trees / MART, random forests, support vector machines, neural networks, genetic algorithms).

3] Robust (logistic regression, k-nearest neighbor, classification trees / CART, boosted classification trees / MART, random forests) versus non-robust but statistically efficient (linear discriminant analysis, naive Bayes, support vector machines, neural networks, genetic algorithms).

All major approaches are defined within the frequentist framework (no prior distribution on parameters) but allow for Bayesian modifications as well (certain prior distribution on parameters). The simple approaches like logistic regression, linear discriminant analysis or k-nearest neighbor should not be looked down upon. By design, they avoid the problem of overfitting and serve as decent competition to advanced techniques in many settings.


CLASSIFICATION SUBCATEGORIES


CLASSIFICATION REFERENCES

Duda, R. O., Hart, P. E., & Stork, D. H. (2000). Pattern Classification (2nd ed). New York: Wiley Interscience.

Agresti, A. (2002). Categorical Data Analysis. New York: Wiley-Interscience.

McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley Interscience.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.

Bishop, C. M (1996). Neural Networks for Pattern Recognition. Oxford University Press.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Witten, I. H., Frank, E., Hall, M., A., Pal, & C. J. (2017). Data Mining: Practical Machine Learning Tools and Techniques (4th ed). New York: Morgan-Kaufmann.

Hilbe, J. M. (2009). Logistic Regression Models. Boca Raton, FL: Chapman & Hall / CRC Press.

Greene, W. H. (2011). Econometric Analysis (7th ed). Upper Saddle River, NJ: Prentice Hall.

Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed). New York: Springer Verlag.  



BACK TO THE
STATISTICAL ANALYSES DIRECTORY


IMPORTANT LINKS ON THIS SITE