Weird phenomenon with SVM: negative examples score higher
I use the VL-Feat and LIBLINEAR to handle the 2-category classification. The #(-)/#(+) for the training set is 35.01 and the dimension of each feature vector is 3.6e5. I have around 15000 examples.
I have set the weight of positive example to be 35.01 and negative examples to be 1 as default. But what I get is extremely poor performance on the test dataset.
So in order to find out the reason, I set the training examples as input. What I see is negative examples get slightly higher decision values than positive ones. It is really weird, right? I've checked the input to make sure I did not mislabel the examples. I've done normalization to the histogram vectors.
Has anybody met this situation before?
Here are the parameters of trained model. I can feel strange about parameters like bias, regularizer and dualityGap, because they are so small that may lose accuracy easily.
model.info = solver: 'sdca' lambda: 0.0100 biasMultiplier: 1 bias: -1.6573e-14 objective: 1.9439 regularizer: 6.1651e-04 loss: 1.9432 dualObjective: 1.9439 dualLoss: 1.9445 dualityGap: -2.6645e-15 iteration: 43868 epoch: 2 elapsedTime: 228.9374
One thing that could be happening is that LIBSVM takes the first example in the data set as the positive class and the negative class the one that isn't the first example in the dataset. So it could be that since you have 35x more negatives than positives, your first example is negative and your classes are being inverted. How to check this? Make sure that the first data point in the training set is of the positive class.
I've checked in the FAQ of LIBLINEAR and it seems it happens in LIBLINEAR as well (I'm not as familiar with LIBLINEAR): http://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html (search for reversed)