Statistical & Financial Consulting by Stanford PhD
SUPPORT VECTOR MACHINE

1. Introduction

Support Vector Machine (SVM) is a machine learning method for regression and classification making use of two ideas: kernelization and slack variables. We will go over these ideas in detail in what follows. Suppose we have $p$ predictors $X_1, ..., X_p$ and dependent variable $Y$. In a classification setting $Y$ is categorical, taking two or more possible values. In a regression setting $Y$ is real-valued (continuous, discrete or a mixture of those two types). Any predictive solution, any rule telling us what $Y$ is likely or expected to be given the $X_1, ..., X_p$ is characterized by one or several multidimensional surfaces. In a classification setting the surfaces separate the $Y$-classes in $R^p,$ which is the $p$-dimensional Euclidean space composed of possible values of $X_1, ..., X_p.$ In a regression setting there is one surface in $R^{p+1},$ giving the expected value of $Y$ conditional on particular values of $X_1, ..., X_p.$ Solving the predictive problem is equivalent to finding the aforementioned surface(s). If the distributional properties of the data are complex the surfaces may be non-linear, wiggly, locally explosive, discontinuous or even harder to imagine. One thing for sure: they will not be characterized by linear functions in $x_1, ..., x_p.$

A standard approach is augmenting the original list of predictors with various non-linear and, potentially, discontinuous transformations:

$x_i = (x_{i1}, ..., x_{ip})\ \ \rightarrow\ \ \Phi(x_i) = (\varphi_1(x_i), ..., \varphi_m(x_i)),\ \ \ \ \ i = 1, ..., n.$

This new feature space is much larger ($m >> p$). A linear surface in this space may correspond to something complex in the original space of predictors, something capable of capturing intricate dependencies containing interactions, non-linearities, localized behavior, and so on. Further, it has been noticed that, with the right choice of the loss function, a classification or regression problem leads to an optimization problem where features $\Phi(x_1), ..., \Phi(x_n)$ enter equations only as inner products of the form $\Phi(x_i)^T \Phi(x_j).$ So we need to know only these values for the purpose of all calculations. Next, let us define

$K(x_i,x_j) = \Phi(x_i)^T \Phi(x_j).$

By a property of inner product bivariate function $K(x,y)$ is a non-negative definite kernel, i. e. for any set of vectors $z_1, ..., z_N$ and any scalars $c_1, ..., c_N$

$\sum_{i=1,...,N} \sum_{j=1,...,N} c_i c_j K(z_i,z_j) \geq 0.$

For some mappings $\Phi()$ kernel $K(x_i,x_j)$ is nice and simple. For example, if

$x_i = (x_{i1}, x_{i2})\ \ \rightarrow\ \ \Phi(x_i) = (x_{i1}^2, x_{i1} x_{i2}, x_{i2}^2),$

then

$K(x_i,x_j) = x_{i1}^2 x_{j1}^2 + 2 x_{i1} x_{j1} x_{i2} x_{j2} + x_{i2}^2 x_{j2}^2 = (x_i^T x_j)^2.$

But can we always expect simplicity? We might get carried away with some interesting, non-trivial transformations. Yet inner products $K(x_i,x_j) = \Phi(x_i)^T \Phi(x_j)$ must always be tractable. More importantly, we may break out into feature spaces of infinite dimensionality. Can we handle infinite sums of the form

$\Phi(x_i)^T \Phi(x_j) = \varphi_1(x_i)^T \varphi_1(x_j) + ... + \varphi_m(x_i)^T \varphi_m(x_j) + ... \ \ \ ?$

The solution lies in looking at the problem from the opposite end. We first identify a kernel $K(x,y)$ we want to work with and then worry if it corresponds to some non-linear set of features $\Phi(x) = (\varphi_1(x), ..., \varphi_m(x)).$ Our search is facilitated by the following, rather famous result.

Theorem (Mercer 1909, version for real-valued functions):
Suppose $K(x,y)$ is a continuous symmetric non-negative kernel on $[a,b]^2$. Then there is an orthonormal basis $\bigl\{e_i(x)\bigr\}_{i \geq 1}$ of Hilbert space $L^2([a,b])$ and a sequence of non-negative eigenvalues $\lambda_i$ such that

$K(x,y) = \sum_{i \geq 1} \lambda_i e_i(x) e_i(y) = \Bigl(\lambda_1^{1/2} e_1(x), ..., \lambda_i^{1/2} e_i(x), ...\Bigr)^T \Bigl(\lambda_1^{1/2} e_1(y), ..., \lambda_i^{1/2} e_i(y), ...\Bigr)$

and the eigenfunctions $e_i(x)$ corresponding to non-zero eigenvalues $\lambda_i$ are continuous.

There are extensions to discontinuous cases as well... Mercer's theorem indicates that our choice of the kernel leads to an implicit choice of the feature space, yet we are not obliged to hear all the details. We are not obliged to work with features $\Phi(x_i)$ directly. Rather, we evaluate inner products $\Phi(x_i)^T \Phi(x_j)$ as algebraic form $K(x_i,x_j)$, which was chosen for its convenience and simplicity. This is known as the kernel trick. It lies in the foundation of the kernelization technique: taking good old statistical methods and changing the $x_i^T x_j$ terms in the solution to $K(x_i,x_j)$, which is equivalent to enlarging the universe onto the feature space corresponding to $K()$ and performing all the calculations there.

Support vector machine is a particular optimization engine phrased in terms of kernels. In the classification setting, SVM aims at making the separating surfaces smooth but also penalizes data points that find themselves on the wrong side of the surface. For each such data point a slack variable is defined, which is the distance to the relevant separating surface. Then certain aggregate functions of all the slacks are added to the objective function and the constraints of the minimization problem. There are several ways of doing this, which gives rise to different flavors of SVM: C-SVC, nu-SVC, bound-constraint SVC, and so on.

In the regression setting, SVM aims at making the predictive surface smooth but also penalizes data points that fall far from the surface. For each data point below the surface a negative slack is defined as the distance to the surface. For each data point above the surface a positive slack is defined as the distance to the surface. Specific aggregate functions of negative and positive slacks are entered into the objective function and the constraints of the minimization problem. The exact definitions of these functions determine the version of SVM: eps-SVR, nu-SVR, bound-constraint SVR or something else.

2. Popular Kernels

In what follows $x_i^Tx_j$ is the regular inner product in Euclidean space,

$||x_i - x_j|| = [(x_i - x_j)^T(x_i - x_j)]^{1/2}$

is the Euclidean norm of the difference and

$||x_i - x_j||_1 = \sum_{k=1,...,p} |x_{ik} - x_{jk}|$

is the $L_1$ norm of the difference. We will present kernels which are most utilized in the area of support vector machines.

1] Linear kernel:

$K(x_i,x_j) = x_i^Tx_j.$

Used on classification problems where the classes are linearly separable (for the most part) and regression problems where the relationship between the dependent and independent variables is mostly linear. Also used, quite successfully, in situations where the number of dimensions is high and the data are sparse as a result. Popular in text mining.

2] Non-homogeneous polynomial kernel:

$K(x_i,x_j) = (x_i^Tx_j + c)^d.$

A direct generalization of Euclidean inner product and, therefore, relatively intuitively interpretable. Works well on certain image recognition problems.

3] Laplace radial basis function (RBF) kernel:

$K(x_i,x_j) = \exp\Bigl\{-\frac{||x_i - x_j||_1}{\gamma} \Bigr\},$

where $\gamma$ is referred to as the bandwidth. Suitable for wiggly and non-differentiable surfaces.

4] Gaussian radial basis function kernel:

$K(x_i,x_j) = \exp\Bigl\{-\frac{||x_i - x_j||^2}{2\sigma^2} \Bigr\},$

where $\sigma$ is the bandwidth. Works well on a variety of non-linear surfaces in high-dimensional problems. Despite its resemblance to Laplace kernel, Gaussian kernel oftentimes delivers completely different answers.

5] Exponential kernel:

$K(x_i,x_j) = \exp\Bigl\{-\frac{||x_i - x_j||}{\sigma} \Bigr\},$

where $\sigma$ is the bandwidth. Closely related to Gaussian kernel but notice the absence of square in the exponent.

6] Hyperbolic tangent kernel (sigmoid kernel):

$K(x_i,x_j) = \tanh(a x_i^Tx_j + b).$

Borrowed from the field of neural networks. Works well on certain image recognition problems.

7] Thin plate spline kernel:

$K(x_i,x_j) = ||x_i - x_j||\log||x_i - x_j||.$

Works well on certain image recognition problems.

8] Spline kernel:

$K(x_i,x_j) = \prod_{k=1,...,p} \Bigl( 1 + x_{ik}x_{jk} + x_{ik}x_{jk}\min(x_{ik},x_{jk}) -$

$\ - \frac{x_{ik} + x_{jk}}{2}\min(x_{ik},x_{jk})^2 + \frac{1}{3}\min(x_{ik},x_{jk})^3 \Bigr).$

Tends to perform well on regression tasks.

9] ANOVA kernel:

$K(x_i,x_j) = \sum_{k=1,...,p} \Bigl(\exp\Bigl\{-\frac{(x_{ik} - x_{jk})^2}{\sigma^2} \Bigr\}\Bigr)^d,$

where $\sigma$ is the bandwidth. Tends to perform well on regression tasks.

10] Cauchy kernel:

$K(x_i,x_j) = \frac{1}{\pi} \frac{1}{1 + ||x_i - x_j||^2}.$

For any collection of non-negative definite kernels $K_1, K_2, ..., K_\nu, ...$ and non-negative scalars $\alpha_1, \alpha_2, ..., \alpha_\nu, ...$ the following expressions are also non-negative definite kernels:

$K_1(x_i,x_j) + \alpha_1 ,$

$\alpha_1 K_1(x_i,x_j) + \alpha_2 K_2(x_i,x_j),$

$K_1(x_i,x_j) K_2(x_i,x_j),$

$\frac{K_1(x_i,x_j)}{(K_1(x_i,x_i) K_1(x_j,x_j))^{1/2}}\ \ \ \ {\rm{(if\ the\ denominator\ \neq\ 0)}},$

$\lim_{\nu\rightarrow\infty} K_\nu(x_i,x_j)\ \ \ \ {\rm{(if\ there\ is\ convergence)}},$

$\sum_{\nu = 1, ..., \infty} \alpha_\nu ^\nu\ \ \ \ {\rm{(if\ there\ is\ convergence)}},$

where $$ is inner product in some Hilbert space $H$ (for example, it can be $x_i^Tx_j,$ the inner product in Euclidean space).

SUPPORT VECTOR MACHINES REFERENCES

Vapnik, V. N. (1998) Statistical Learning Theory. Wiley-Interscience.

Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

Barber, D. (2014). Bayesian Reasoning and Machine Learning. Cambridge University Press.

Hsieh, C., Chang, K., Lin, C., Keerthi, S. S., & Sundararajan, S. (2008). A Dual Coordinate Descent Method for Large-scale Linear SVM. Proceedings of the 25th International Conference on Machine Learning. ICML '08. New York, NY, USA: ACM. pp. 408–415.

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

SUPPORT VECTOR MACHINES RESOURCES

BACK TO THE
STATISTICAL ANALYSES DIRECTORY