# Simple example using BernoulliNB (naive bayes classifier) scikit-learn in python - cannot explain classification

Using scikit-learn 0.10

Why does the following trivial code snippet:

from sklearn.naive_bayes import *

import sklearn
from sklearn.naive_bayes import *

print sklearn.__version__

X = np.array([ [1, 1, 1, 1, 1],
[0, 0, 0, 0, 0] ])
print "X: ", X
Y = np.array([ 1, 2 ])
print "Y: ", Y

clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( [0, 0, 0, 0, 0] )

Print out an answer of "1" ? Having trained the model on [0,0,0,0,0] => 2 I was expecting "2" as the answer.

And why does replacing Y with

Y = np.array([ 3, 2 ])

Give a different class "2" as an answer (the correct one) ? Isn't this just a class label?

Can someone shed some light on this?

By default, alpha, the smoothing parameter is one. As msw said, your training set is very small. Due to the smoothing, no information is left. If you set alpha to a very small value, you should see the result you expected.

Your training set is too small as can be shown by

clf.predict_proba(X)

which yields

array([[ 0.5,  0.5],
[ 0.5,  0.5]])

which shows that the classifier views all classifications as equiprobable. Compare with the sample shown in the documentation for BernoulliNB for which predict_proba() yields:

array([[ 2.71828146,  1.00000008,  1.00000004,  1.00000002,  1.        ],
[ 1.00000006,  2.7182802 ,  1.00000004,  1.00000042,  1.00000007],
[ 1.00000003,  1.00000005,  2.71828149,  1.        ,  1.00000003],
[ 1.00000371,  1.00000794,  1.00000008,  2.71824811,  1.00000068],
[ 1.00000007,  1.0000028 ,  1.00000149,  2.71822455,  1.00001671],
[ 1.        ,  1.00000007,  1.00000003,  1.00000027,  2.71828083]])

where I applied numpy.exp() to results to make them more readable. Obviously, the probabilities are not even close to equal and in fact well classify the training set.