Package org.biojava.stats.svm

Support Vector Machine classification and regression.

These classes are experimental. They implement support vector machines for classification and regression. Support vector machines are powerful artificial intelligence engines which construct hyperplanes in very high dimensional feature spaces using time and memory related to the number of training examples, not the size of the feature space.

A dot product <a, b> is equal to ||a|| * ||b|| cos(theta) where the sizes of a and b are scalars giving the lengths of a and b, and * indicates normal multiplication. Theta is the angle between the items a and b. It should be clear from this that the dot product of two scalars is the value you would get by the normal notion of multiplication, as cos(theta) is by definition 1.

The equation of a plane is: W * x + b = 0, where W is a vector representation of the normal of the plane, x is any point within the plane and b is a scalar constant giving the minimal distance of the plane from the origin. If we compute f(x) = W * x + b for an x that does not lie on the plane, then f(x) is equal to the minimal distance of x from the plane. If f(x) = y where y is the correct label or value for x then we have a working artificial intelligence engine. In reality, we find an f(x) that minimizes sum_i (||f(x_i) - y_i||) for the training examples x_i, y_i.

The normal vector W could be represented as a sum of vectors such that sum_i (w_i) = W. We can further break each w_i down into two parts, a point (x_i) and a scaling factor (a_i) so that W = sum_i (a_i * x_i). It follows that f(x) can be represented by sum_i (a_i * x_i * x) + b.

It is unlikely that the observed values of your data are the best indicators upon which to classify the points. For example, when looking at heart-attacks, you may measure body-mass, height and age, but the distinguishing variable may be some mixture of these three variables, and you forgot to add in the effects of stress. Support vector machines can not tell you to go out and add a stress field to the data, but it can find that optimal mix of height, age and body-mass that pushes you over the edge into a high-risk of heart disease category. It does this by using a kernel function.

In the above situation, the data-space was three dimensional. However, the solution is embedded in a feature space containing all possible combinations of these three variables. We can define a mapping function M:X->F such that M(x) = f and f is a point in the space of all possible combinations of the tree variables.

To construct a plane in F space, we must calculate the dot product of two points in that space. The expression <M(a), M(b)> can be re-written as k(a, b) where k implicitly maps the points a and b into F-space, and returns the dot product there. To be useful to us, the function k must be expressible as k(W, x) = sum_i (a_i * k(x_i, x)) for suitable values of a_i and x_i. It so happens that to search through all combinations of our heart-attack indicators we can use k(a, b) = (s * a * b + c)^d where s is a scaling factor, a * b is a dot product, c is a constant and d is a power. s and c allow you to tune the relative lengths of the unit vectors along each axis relative to the input space. d is the number of input space dimensions that contribute to each feature space dimension. For example, d=2 with c = 0 and s = 1 would produce a feature space of the form:


height*height, height*weight, height*age,

weight*height, weight* weight, weight*age,

age*height, age*weight, age*age
This is in effect the space of input^d (the dimensions of feature space are the cross product of the input space d times). The variable c mixes in all of the spaces of the form input^i; i = {0 .. d}. Points in this feature space represent the coefficients of all possible polynomials of order d. Since any polynomeal can be constructed from a weighted sum of other polynomials, this particular kernel function is useful to us.

So, the equasion for our classifier now becomes:

f(x) = sum_i (a_i * k(x_i, x)) + b
and the problem of training becomes setting a_i and b optimaly given training data points x_i to minimize the cumulative error of ||f(x_i) - y_i|| over all training examples.

It should be obvious that we can only explore the sub-space of the feature space that is spannable as a linear combination of the input data. This has the practical effect that although the feature space may be infinite dimensional, we explore the finite sub-space that the training points are embedded within. This guarantees that the model f(x) can never contain more information than the training examples x_i, y_i. Also, as the hyperplane chosen is the best one, it is most likely to generalise well to new data, assuming that it was drawn from the same population as the training examples. In practice, usually only a small subset of the weights a_i need to be set to a value other than 0, so the solution is over an even smaller subspace. The training examples that have a non-zero a_i are called the support vectors, as they support the hyperplane.

Many kernels, such as the polynomial kernel, calculate the dot-product in a feature space that is a combination of the input space dimensions, and do so by transforming the input space dot product in some way. Let us re-visit the polynomial kernel to illustrate one of the most striking points about kernel functions - that they can be nested to build very specific feature spaces.


kpoly(a, b | s, c, d) = (s * a * b + c)^d
but a * b is <a, b> in linear space. Rewriting using a kernel k gives:
kpoly(a, b | s, c, d, k) = (s * k(a, b) + c)^d
where k performs the dot product in the space that the polynomial kernel will cross-product.
In this way, kernels can be layered one on another, to scale, stretch and combine sub-spaces to produce the feature space that you believe is optimal for locating f(x).

The design philosophy for kernels is as follows. There are basic kernels that calculate a type-specific dot product. These will normally be found by looking for a static final method named Kernel of type SVMKernel within a class. Then there are permutation kernels. These include PolynomialKernel and RadialBaseKernel. These use a primary kernel to calculate the dot product in the natural space for the objects, and then performs some function on that which produces a result equivalent to doing the same dot-product in a very much higher dimensional feature space. To avoid needless recalculation, there are some caching kernels that only calculate the value of the kernel they wrap if needed, and retrieve the value from a cache otherwise.

As far as possible, kernels are java beans.