Statistical & Financial Consulting by Stanford PhD
Home Page

Identifying groups of individuals or objects that are similar to each other but different from individuals in other groups can be intellectually satisfying, profitable, or sometimes both. Using your customer base, you may be able to form clusters of customers who have similar buying habits or demographics. You can take advantage of these similarities to target offers to subgroups that are most likely to be receptive to them. Based on scores on psychological inventories, you can cluster patients into subgroups that have similar response patterns. This may help you in targeting appropriate treatment and studying typologies of diseases. By analyzing the mineral contents of excavated materials, you can study their origins and spread.

Although both cluster analysis and discriminant analysis classify objects (or cases) into categories, discriminant analysis requires you to know group membership for the cases used to derive the classification rule. The goal of cluster analysis is to identify the actual groups. For example, if you are interested in distinguishing between several disease groups using discriminant analysis, cases with known diagnoses must be available. Based on these cases, you derive a rule for classifying undiagnosed patients. In cluster analysis, you don’t know who or what belongs in which group. You often don’t even know the number of groups.

You start out with a number of cases and want to subdivide them into homogeneous groups. First, you choose the variables on which you want the groups to be similar. Next, you must decide whether to standardize the variables in some way so that they all contribute equally to the distance or similarity between cases. Finally, you have to decide which clustering procedure to use, based on the number of cases and types of variables that you want to use for forming clusters.

For hierarchical clustering, you choose a statistic that quantifies how far apart (or similar) two cases are. Then you select a method for forming the groups. Because you can have as many clusters as you do cases, your last step is to determine how many clusters you need to represent your data. You do this by looking at how similar clusters are when you create additional clusters or collapse existing ones. In k-means clustering, you select the number of clusters you want. The algorithm iteratively estimates the cluster means and assigns each case to the cluster for which its distance to the cluster mean is the smallest.

In two-step clustering, to make large problems tractable, in the first step, cases are assigned to “preclusters.” In the second step, the preclusters are clustered using the hierarchical clustering algorithm. You can specify the number of clusters you want or let the algorithm decide based on preselected criteria.

The term "cluster analysis" does not identify a particular statistical method or model, as do discriminant analysis, factor analysis, and regression. You often don’t have to make any assumptions about the underlying distribution of the data. Using cluster analysis, you can also form groups of related variables, similar to what you do in factor analysis.There are numerous ways you can sort cases into groups. The choice of a method depends on, among other things, the size of the data file. Methods commonly used for small data sets are impractical for data files with thousands of cases. For the list and descriptions of clustering methods see the references below.


Everitt, B. S., Landau, S., Leese, M. and Stahl, D. (2011), Cluster Analysis, Wiley Series in Probability and Statistics.

Michie, D., Spiegelhalter, D. and Taylor, C., eds (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood series in Artificial Intelligence, Ellis Horwood.

Dasarathy, B. (1991), Nearest neighbor pattern classification techniques, IEEE Computer Society Press.

Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.