In a classification setting we need to solve the following problem. We observe N objects. For each object we know the values of variables X1, ..., Xp. We also know that the objects are split into classes 1, 2, ..., K. For each object we know its class membership. We need to develop a statistical method that allows to identify class membership of a new object for which only the values of X1, ..., Xp are known.
There are many approaches to classification. They can be classified themselves according to the following criteria.
1] Parametric (logistic regression, linear discriminant analysis, naive Bayes) versus nonparametric (k-nearest neighbor, classification trees / CART, boosted classification trees / MART, random forests, support vector machines, neural networks, genetic algorithms).
2] Simple (logistic regression, linear discriminant analysis, naive Bayes, k-nearest neighbor) versus complex (classification trees / CART, boosted classification trees / MART, random forests, support vector machines, neural networks, genetic algorithms).
3] Robust (logistic regression, k-nearest neighbor, classification trees / CART, boosted classification trees / MART, random forests) versus non-robust but statistically efficient (linear discriminant analysis, naive Bayes, support vector machines, neural networks, genetic algorithms).
All major approaches are defined within the frequentist framework (no prior distribution on parameters) but allow for Bayesian modifications as well (certain prior distribution on parameters). The simple approaches like logistic regression, linear discriminant analysis or k-nearest neighbor should not be looked down upon. By design, they avoid the problem of overfitting and serve as decent competition to advanced techniques in many settings.
CLASSIFICATION SUBCATEGORIES