Nearest Neighbor in R, Matlab, SPSS, SAS - Data Mining Tutor in New York, Boston, Chicago, London, Toronto, Sydney, Melbourne, San Francisco, Los Angeles

Statistical & Financial Consulting by Stanford PhD

Home Page

NEAREST NEIGHBOR

The K-nearest Neighbor Algorithm is one of the simplest methods for classification and prediction. Its simplicity does not undermine its competitiveness, nonetheless. Over a wide range of classification problems k-nearest neighbor gets into top 3-4 performers, often beating more sophisticated off-the-shelf methods. Nearest neighbor has widespread use in medicine, biology, marketing and finance.

We distinguish two settings: classification and regression. In the classification setting we work with a set of training data points of the form (X,Y). X is a vector of attributes and Y is a categorical variable of particular importance. Example: Y is healthy/sick and X is the blood pressure, hemoglobin A1C level, temperature, age, gender, race, etc. Suppose we observe a new data point (X₀,?). Here the attributes are known but variable Y has to be predicted. Since Y is categorical, predicting it is essentially choosing the likeliest category. We find points (X₁,Y₁), ... , (X_k,Y_k) such that each X_i is among k closest vectors to X₀ in the training data set. As the measure of distance between the points we choose some standard loss function. For example, if all the attributes are numeric we can use Euclidean distance on the standardized attributes. Then, we predict Y for data point (X₀,?) as

Y₀ = most frequent category among values Y₁, ... , Y_k.

If there are two or more most frequent categories, randomization is used.

The regression setting is quite similar. We have a set of training data points of the form (X,Y), where X is a vector of attributes and Y is a variable of particular importance. Now however, Y is numeric and not categorical. Example: Y is the return on a stock in a given year and X is the company capitalization, number of employees, average past returns, industry type, geographic location, return on S&P, return on oil futures, etc. We observe a new data point (X₀,?), where the attributes are known but Y has to be predicted. Since Y is numeric, averaging relevant observations is fine - the prediction does not have to fall into any particular category. Therefore, we find points (X₁,Y₁), ... , (X_k,Y_k) such that each X_i is among k closest vectors to X₀ in the training data set. We predict variable Y as

Y₀ = weighted average of values Y₁, ... , Y_k.

Weights can be inverse distances of X₀ to X₁, ... , X_k, or something else. Equal weights are not advisable, as this may result in large discontinuities in the prediction surface.

NEAREST NEIGHBOR REFERENCES

Dasarathy, B. V., ed (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press.

Shakhnarovich G., Darrell T., & Indyk P., ed (2006). Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Cambridge, MA: MIT Press.

Cover T. M., & Hart P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): pp. 21–27.

Hall, P., Park, B. U., & Samworth, R. J. (2008). Choice of neighbor order in nearest-neighbor classification. Annals of Statistics 36 (5): pp. 2135–2152.

NEAREST NEIGHBOR RESOURCES

Videolectures on k-nearest neighbor classification

BACK TO THE STATISTICAL ANALYSES DIRECTORY

IMPORTANT LINKS ON THIS SITE

Detailed description of the services offered in the areas of statistical consulting and financial consulting: home page, types of service, experience, case studies, payment options and statistics tutoring
Directory of financial topics

consulting@stanfordphd.com