Another approach involves maximizing the weight-vector's `safety margin', i.e., its inner product with the most nearly mis-classified datapoint.

This gives us the so-called **maximum margin** classifier.

We need to move to a non-linear solution, as we did in moving from delta-rule learning to MLPs.

Ideally, we'd like to map the data into a feature space in which we can form a separating hyperplane.

We'd like them to be non-linear functions (curved boundaries are needed).

But there are infinitely many of these.

One solution is to use the so-called **kernel trick**.

A kernel function maps pairs of datapoints onto their inner products (i.e., they work like distance functions).

A feature space based on a kernel function has one dimension for every pair of datapoints.

Mathematical minimization can then be used to find the max-margin hyperplane in the feature-space.

The effect is to identify a non-linear (curved) boundary in the original data space.

Manipulating points in the feature space then has the effect of `stretching' or `compressing' areas of the data space.

This can be a way of `pulling' differently classified datapoints apart, or `pushing' same-class points together.

But their practical value remains unclear at this stage.

Derivation of weights for a separating hyperplane may still be best done using iterative error-correction.

Another problem is the kernel function itself.

With primitive data (e.g., 2d data points), good kernels are easy to come by.

With the forms of data we're often interested in (web pages, MRI scans etc.), finding a sensible kernel function may be much harder.

How would we go about defining a function that gives the distance between two web pages?

As usual, success depends on getting the problem into the right representation.

- Max-margin classifiers can be derived by minimization.
- Kernel-based SVMs
- Complexity problems
- The difficulty of finding good kernel functions.

- In what ways might we calculate the distance (dissimilarity)
between web pages?
- In the SVM method, we distort the data space so as to enable
simple (e.g., hyperplane-based) representation of the target
function. Can the components of the distortion be viewed as genuine
*features*? - How is generalization performance likely to be affected, where
the SVM produces a high degree of data-space distortion?

- www.kernel-machines.org
- www.support-vector.net
- Vapnik, V. (1995). THE NATURE OF STATISTICAL LEARNING THEORY. New
York: Springer.
- Vapnik, V. (2000). THE NATURE OF STATISTICAL LEARNING THEORY.
Springer.
- Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. THEOR. PROBAB. APPL., 16, No. 2 (pp. 264-280).