Statistical & Financial Consulting by Stanford PhD

Home Page

The K-nearest Neighbor Algorithm is one of the simplest methods for classification and prediction. Its simplicity does not undermine its competitiveness, nonetheless. Over a wide range of classification problems k-nearest neighbor gets into top 3-4 performers, often beating more sophisticated off-the-shelf methods. Nearest neighbor has widespread use in medicine, biology, marketing and finance.

We distinguish two settings: classification and regression. In the classification setting we work with a set of training data points of the form (X,Y). X is a vector of attributes and Y is a categorical variable of particular importance. Example: Y is healthy/sick and X is the blood pressure, hemoglobin A1C level, temperature, age, gender, race, etc. Suppose we observe a new data point (X_{0},?). Here the attributes are known but variable Y has to be predicted. Since Y is categorical, predicting it is essentially choosing the likeliest category. We find points (X_{1},Y_{1}), ... , (X_{k},Y_{k}) such that each X_{i} is among k closest vectors to X_{0} in the training data set. As the measure of distance between the points we choose some standard loss function. For example, if all the attributes are numeric we can use Euclidean distance on the standardized attributes. Then, we predict Y for data point (X_{0},?) as

Y_{0} = most frequent category among values Y_{1}, ... , Y_{k}.

If there are two or more most frequent categories, randomization is used.

The regression setting is quite similar. We have a set of training data points of the form (X,Y), where X is a vector of attributes and Y is a variable of particular importance. Now however, Y is numeric and not categorical. Example: Y is the return on a stock in a given year and X is the company capitalization, number of employees, average past returns, industry type, geographic location, return on S&P, return on oil futures, etc. We observe a new data point (X_{0},?), where the attributes are known but Y has to be predicted. Since Y is numeric, averaging relevant observations is fine - the prediction does not have to fall into any particular category. Therefore, we find points (X_{1},Y_{1}), ... , (X_{k},Y_{k}) such that each X_{i} is among k closest vectors to X_{0} in the training data set. We predict variable Y as

Y_{0} = weighted average of values Y_{1}, ... , Y_{k}.

Weights can be inverse distances of X_{0} to X_{1}, ... , X_{k}, or something else. Equal weights are not advisable, as this may result in large discontinuities in the prediction surface.

**NEAREST NEIGHBOR REFERENCES
**

Dasarathy, B. V., ed (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press.

Shakhnarovich G., Darrell T., & Indyk P., ed (2006). Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Cambridge, MA: MIT Press.

Cover T. M., & Hart P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): pp. 21–27.

Hall, P., Park, B. U., & Samworth, R. J. (2008). Choice of neighbor order in nearest-neighbor classification. Annals of Statistics 36 (5): pp. 2135–2152.

- Videolectures on k-nearest neighbor classification
- Links Related to Nearest Neighbors and Similarity Search

- Detailed description of the services offered in the areas of statistical consulting and financial consulting: home page, types of service, experience, case studies, payment options and statistics tutoring
- Directory of financial topics