SUPPORT VECTOR MACHINE
1. Introduction
Support Vector Machine (SVM) is a machine learning method for regression and classification making use of two ideas: kernelization and slack variables. We will go over these ideas in detail in what follows. Suppose we have
predictors
and dependent variable
. In a classification setting
is categorical, taking two or more possible values. In a regression setting
is real-valued (continuous, discrete or a mixture of those two types). Any predictive solution, any rule telling us what
is likely or expected to be given the
is characterized by one or several multidimensional surfaces. In a classification setting the surfaces separate the
-classes in
which is the
-dimensional Euclidean space composed
of possible values of
In a regression setting there is one surface in
giving the expected value of
conditional on particular values of
Solving the predictive problem is equivalent to finding the aforementioned surface(s). If the distributional properties of the data are complex the surfaces may be non-linear, wiggly, locally explosive, discontinuous or even harder to imagine. One thing for sure: they will not be characterized by linear functions in
A standard approach is augmenting the original list of predictors with various non-linear and, potentially, discontinuous transformations:
This new feature space is much larger
().
A linear surface in this space may correspond to something complex in the original space of predictors, something capable of capturing intricate dependencies containing interactions, non-linearities, localized behavior, and so on. Further, it has been noticed that, with the right choice of the loss function, a classification or regression problem leads to an optimization problem where features
enter equations only as inner products of the form
So we need to know only these values for the purpose of all calculations. Next, let us define
By a property of inner product bivariate function
is a non-negative definite kernel, i. e. for any set of vectors
and any scalars
For some mappings
kernel
is nice and simple. For example, if
then
But can we always expect simplicity? We might get carried away with some interesting, non-trivial transformations. Yet inner products
must always be tractable. More importantly, we may break out into feature spaces of infinite dimensionality. Can we handle infinite sums
of the form
The solution lies in looking at the problem from the opposite end. We first identify a kernel
we want to work with and then worry if it corresponds to some non-linear set of features
Our search is facilitated by the following, rather famous result.
Theorem (Mercer 1909, version for real-valued functions):
Suppose
is a continuous symmetric non-negative kernel on
.
Then there is an orthonormal basis
of Hilbert space
and a sequence of non-negative eigenvalues
such that
and the eigenfunctions
corresponding to non-zero eigenvalues
are continuous.
There are extensions to discontinuous cases as well... Mercer's theorem indicates that our choice of the kernel leads to an implicit choice of the feature space, yet we are not obliged to hear all the details. We are not obliged to work with features
directly. Rather, we evaluate inner products
as algebraic form , which was chosen for its convenience and simplicity. This is known as the kernel trick. It lies in the foundation of the kernelization technique: taking good old statistical methods and changing the
terms in the solution to , which is equivalent to enlarging the universe onto the feature space corresponding to
and performing all the calculations there.
Support vector machine is a particular optimization engine phrased in terms of kernels. In the classification setting, SVM aims at making the separating surfaces smooth but also penalizes data points that find themselves on the wrong side of the surface. For each such data point a slack variable is defined, which is the distance to the relevant separating surface. Then certain aggregate functions of all the slacks are added to the objective function and the constraints of the minimization problem. There are several ways of doing this, which gives rise to different flavors of SVM: C-SVC, nu-SVC, bound-constraint SVC, and so on.
In the regression setting, SVM aims at making the predictive surface smooth but also penalizes data points that fall far from the surface. For each data point below the surface a negative slack is defined as the distance to the surface. For each data point above the surface a positive slack is defined as the distance to the surface. Specific aggregate functions of negative and positive slacks are entered into the objective function and the constraints of the minimization problem. The exact definitions of these functions determine the version of SVM: eps-SVR, nu-SVR, bound-constraint SVR or something else.
2. Popular Kernels
In what follows
is the regular inner product in Euclidean space,
is the Euclidean norm of the difference and
is the
norm of the difference. We will present kernels which are most utilized in the area of support vector machines.
1] Linear kernel:
Used on classification problems where the classes are linearly separable (for the most part) and regression problems where the relationship between the dependent and independent variables is mostly linear. Also used, quite successfully, in situations where the number of dimensions is high and the data are sparse as a result. Popular in text mining.
2] Non-homogeneous polynomial kernel:
A direct generalization of Euclidean inner product and, therefore, relatively intuitively interpretable. Works well on certain image recognition problems.
3] Laplace radial basis function (RBF) kernel:
where
is referred to as the bandwidth. Suitable for wiggly and non-differentiable surfaces.
4] Gaussian radial basis function kernel:
where
is the bandwidth. Works well on a variety of non-linear surfaces in high-dimensional problems. Despite its resemblance to Laplace kernel, Gaussian kernel oftentimes delivers completely different answers.
5] Exponential kernel:
where
is the bandwidth. Closely related to Gaussian kernel but notice the absence of square in the exponent.
6] Hyperbolic tangent kernel (sigmoid kernel):
Borrowed from the field of neural networks. Works well on certain image recognition problems.
7] Thin plate spline kernel:
Works well on certain image recognition problems.
8] Spline kernel:
Tends to perform well on regression tasks.
9] ANOVA kernel:
where
is the bandwidth. Tends to perform well on regression tasks.
10] Cauchy kernel:
For any collection of non-negative definite kernels
and non-negative scalars
the following expressions are also non-negative definite kernels:
where
is inner product in some Hilbert space
(for example, it can be
the inner product in Euclidean space).
SUPPORT VECTOR MACHINES REFERENCES
Vapnik, V. N. (1998) Statistical Learning Theory. Wiley-Interscience.
Scholkopf, B., & Smola, A. J. (2002). Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.
Efron, B., & Hastie, T. (2017). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.
Barber, D. (2014). Bayesian Reasoning and Machine Learning. Cambridge University Press.
Hsieh, C., Chang, K., Lin, C., Keerthi, S. S., & Sundararajan, S. (2008). A Dual Coordinate Descent Method for Large-scale Linear SVM. Proceedings of the 25th International Conference on Machine Learning. ICML '08. New York, NY, USA: ACM. pp. 408–415.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
SUPPORT VECTOR MACHINES RESOURCES
BACK TO THE STATISTICAL ANALYSES DIRECTORY
IMPORTANT LINKS ON THIS SITE