Statistical & Financial Consulting by Stanford PhD
NEAREST NEIGHBOR

The K-nearest Neighbor Algorithm is one of the simplest methods for classification and prediction. Its simplicity does not undermine its competitiveness, nonetheless. Over a wide range of classification problems k-nearest neighbor gets into top 3-4 performers, often beating more sophisticated off-the-shelf methods. Nearest neighbor has widespread use in medicine, biology, marketing and finance.

We distinguish two settings: classification and regression. In the classification setting we work with a set of training data points of the form (X,Y). X is a vector of attributes and Y is a categorical variable of particular importance. Example: Y is healthy/sick and X is the blood pressure, hemoglobin A1C level, temperature, age, gender, race, etc. Suppose we observe a new data point (X0,?). Here the attributes are known but variable Y has to be predicted. Since Y is categorical, predicting it is essentially choosing the likeliest category. We find points (X1,Y1), ... , (Xk,Yk) such that each Xi is among k closest vectors to X0 in the training data set. As the measure of distance between the points we choose some standard loss function. For example, if all the attributes are numeric we can use Euclidean distance on the standardized attributes. Then, we predict Y for data point (X0,?) as

Y0 = most frequent category among values Y1, ... , Yk.

If there are two or more most frequent categories, randomization is used.

The regression setting is quite similar. We have a set of training data points of the form (X,Y), where X is a vector of attributes and Y is a variable of particular importance. Now however, Y is numeric and not categorical. Example: Y is the return on a stock in a given year and X is the company capitalization, number of employees, average past returns, industry type, geographic location, return on S&P, return on oil futures, etc. We observe a new data point (X0,?), where the attributes are known but Y has to be predicted. Since Y is numeric, averaging relevant observations is fine - the prediction does not have to fall into any particular category. Therefore, we find points (X1,Y1), ... , (Xk,Yk) such that each Xi is among k closest vectors to X0 in the training data set. We predict variable Y as

Y0 = weighted average of values Y1, ... , Yk.

Weights can be inverse distances of X0 to X1, ... , Xk, or something else. Equal weights are not advisable, as this may result in large discontinuities in the prediction surface.

NEAREST NEIGHBOR REFERENCES

Dasarathy, B. V., ed (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press.

Shakhnarovich G., Darrell T., & Indyk P., ed (2006). Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Cambridge, MA: MIT Press.

Cover T. M., & Hart P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): pp. 21–27.

Hall, P., Park, B. U., & Samworth, R. J. (2008). Choice of neighbor order in nearest-neighbor classification. Annals of Statistics 36 (5): pp. 2135–2152.

NEAREST NEIGHBOR RESOURCES

BACK TO THE
STATISTICAL ANALYSES DIRECTORY

IMPORTANT LINKS ON THIS SITE