Precomputed Kernels with LibSVM in Python
I've been searching the net for ~3 hours but I couldn't find a solution yet. I want to give a precomputed kernel to libsvm and classify a dataset, but:
How can I generate a precomputed kernel? (for example, what is the basic precomputed kernel for Iris data?)
In the libsvm documentation, it is stated that:
For precomputed kernels, the first element of each instance must be the ID. For example,
samples = [[1, 0, 0, 0, 0], [2, 0, 1, 0, 1], [3, 0, 0, 1, 1], [4, 0, 1, 1, 2]] problem = svm_problem(labels, samples) param = svm_parameter(kernel_type=PRECOMPUTED)
What is a ID? There's no further details on that. Can I assign ID's sequentially?
Any libsvm help and an example of precomputed kernels really appreciated.
First of all, some background to kernels and SVMs...
If you want to pre-compute a kernel for n vectors (of any dimension), what need to do is calculate the kernel function between each pair of examples. The kernel function takes two vectors and gives a scalar, so you can think of a precomputed kernel as a nxn matrix of scalars. It's usually called the kernel matrix, or sometimes the Gram matrix.
There are many different kernels, the simplest is the linear kernel (also known as the dot product):
sum(x_i * y_i) for i in [1..N] where (x_1,...,x_N) (y_1,..,y_N) are vectors
Secondly, trying to answer your problem...
The documentation about precomputed kernels in libsvm is actually pretty good...
Assume the original training data has three four-feature instances and testing data has one instance: 15 1:1 2:1 3:1 4:1 45 2:3 4:3 25 3:1 15 1:1 3:1 If the linear kernel is used, we have the following new training/testing sets: 15 0:1 1:4 2:6 3:1 45 0:2 1:6 2:18 3:0 25 0:3 1:1 2:0 3:1 15 0:? 1:2 2:0 3:1
Each vector here in the second example is a row in the kernel matrix. The value at index zero is the ID value and it just seems to be a sequential count. The value at index 1 of the first vector is the value of the kernel function of the first vector from the first example with itself (i.e. (1x1)+(1x1)+(1x1)+(1x1) = 4), the second is the value of the kernel function of the first vector with the second (i.e. (1x3)+(1x3)=6). It follows on like that for the rest of the example. You can see in that the kernel matrix is symmetric, as it should be, because K(x,y) = K(y,x).
It's worth pointing out that the first set of vectors are represented in a sparse format (i.e. missing values are zero), but the kernel matrix isn't and shouldn't be sparse. I don't know why that is, it just seems to be a libsvm thing.
scikit-learn hides most of the details of libsvm when handling custom kernels. You can either just pass an arbitrary function as your kernel and it will compute the gram matrix for you or pass the precomputed Gram matrix of the kernel.
For the first one, the syntax is:
>>> from scikits.learn import svm >>> clf = svm.SVC(kernel=my_kernel)
where my_kernel is your kernel function, and then you can call clf.fit(X, y) and it will compute the kernel matrix for you. In the second case the syntax is:
>>> from scikits.learn import svm >>> clf = svm.SVC(kernel="precomputed")
And when you call clf.fit(X, y), X must be the matrix k(X, X), where k is your kernel. See also this example for more details:
Here is a simple two category 3 vector custom kernel input file that works correctly. I will explain the parts (though you should also see StompChicken's answer):
1 0:1 1:10 2:12 3:21 2 0:2 1:12 2:19 3:30 1 0:3 1:21 2:30 3:130
The first number on each line is which category it belongs to. The next entry on each line is of the form 0:n and it must be sequential, i.e. 0:1 on first entry 0:2 on second entry 0:3 on thrid entry
A possible reason for this is that libsvm returns values alpha_i that go with your vectors in the output file, but for precomputed kernels the vectors are not displayed (which could be truly huge) rather the index 0:n that went with that vector is shown to make your output easier to match up with your input. Especially since the output is not in the same order you put them in it is grouped by category. It is thus very useful for you when reading the input file to be able to match up libsvm's outputs with your own inputs to have those 0:n values. Here you can see the output
svm_type c_svc kernel_type precomputed nr_class 2 total_sv 3 rho -1.53951 label 1 2 nr_sv 2 1 SV 0.4126650675419768 0:1 0.03174528241667363 0:3 -0.4444103499586504 0:2
It is important to note that with precomputed kernels you cannot omit the zero entries like you can with all other kernels. They must be explicitly included.
I believe that the scikit-learn's python binding of libSVM should address the problem.
See the documentation at http://scikit-learn.sourceforge.net/modules/svm.html#kernel-functions , for more information.