Statistical & Financial Consulting by Stanford PhD

Home Page

**1. Introduction**

Support Vector Machine (SVM) is a classification and regression method making use of two ideas: *kernelization* and *slack variables*. We will go over these ideas in detail in what follows. Suppose we have
_{}
predictors
_{}
and dependent variable
. In a classification setting
is categorical, taking two or more possible values. In a regression setting
is real-valued (continuous, discrete or a mixture of those two types). Any predictive solution, any rule telling us what
is likely or expected to be given the
_{}
is characterized by one or several multidimensional surfaces. In a classification setting the surfaces separate the
-classes in
_{}
which is the
_{}-dimensional Euclidean space composed
of possible values of
_{}
In a regression setting there is one surface in
_{}
giving the expected value of
conditional on particular values of
_{}
Solving the predictive problem is equivalent to finding the aforementioned surface(s). If the distributional properties of the data are complex the surfaces may be non-linear, wiggly, locally explosive, discontinuous or even harder to imagine. One thing for sure: they will not be characterized by linear functions in
_{}

A standard approach is augmenting the original list of predictors with various non-linear and, potentially, discontinuous transformations:

This new

By a property of inner product bivariate function

For some mappings

then

But can we always expect simplicity? We might get carried away with some interesting, non-trivial transformations. Yet inner products

The solution lies in looking at the problem from the opposite end. We first identify a kernel

Suppose

and the eigenfunctions

There are extensions to discontinuous cases as well... Mercer's theorem indicates that our choice of the kernel leads to an implicit choice of the feature space, yet we are not obliged to hear all the details. We are not obliged to work with features

Support vector machine is a particular optimization engine phrased in terms of kernels. In the classification setting, SVM aims at making the separating surfaces smooth but also penalizes data points that find themselves on the wrong side of the surface. For each such data point

In the regression setting, SVM aims at making the predictive surface smooth but also penalizes data points that fall far from the surface. For each data point below the surface a negative slack is defined as the distance to the surface. For each data point above the surface a positive slack is defined as the distance to the surface. Specific aggregate functions of negative and positive slacks are entered into the objective function and the constraints of the minimization problem. The exact definitions of these functions determine the version of SVM: eps-SVR, nu-SVR, bound-constraint SVR or something else.

In what follows

is the Euclidean norm of the difference and

is the

1]

Used on classification problems where the classes are linearly separable (for the most part) and regression problems where the relationship between the dependent and independent variables is mostly linear. Also used, quite successfully, in situations where the number of dimensions is high and the data are sparse as a result. Popular in text mining.

2]

A direct generalization of Euclidean inner product and, therefore, relatively intuitively interpretable. Works well on certain image recognition problems.

3]

where

4]

where is the bandwidth. Works well on a variety of non-linear surfaces in high-dimensional problems. Despite its resemblance to Laplace kernel, Gaussian kernel oftentimes delivers completely different answers.

5]

where is the bandwidth. Closely related to Gaussian kernel but notice the absence of square in the exponent.

6]

Borrowed from the field of neural networks. Works well on certain image recognition problems.

7]

Works well on certain image recognition problems.

8]

Tends to perform well on regression tasks.

9]

where is the bandwidth. Tends to perform well on regression tasks.

10]

For any collection of non-negative definite kernels

where

Vapnik, V. N. (1998) Statistical Learning Theory. Wiley-Interscience.

Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

Barber, D. (2014). Bayesian Reasoning and Machine Learning. Cambridge University Press.

Hsieh, C., Chang, K., Lin, C., Keerthi, S. S., & Sundararajan, S. (2008). A Dual Coordinate Descent Method for Large-scale Linear SVM. Proceedings of the 25th International Conference on Machine Learning. ICML '08. New York, NY, USA: ACM. pp. 408–415.

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

- Support Vector Machines Tutorial (covers implementations in R and C#)
- Introductory Tutorial on Support Vector Machines (focus on classification problems)
- Data Mining Resources, Dept. of Computer Science, Purdue University Data Mining Resources

- Kernel Functions for Machine Learning Applications (an extensive dictionary, albeit with slightly different conventions than usual)

- Detailed description of the services offered in the areas of statistical and financial consulting: home page, types of service, experience, case studies, payment options and statistics tutoring
- Directory of financial topics