Scikit grid search for KNN regression ValueError: Array contains NaN or infinity

I am trying to implement grid search for selecting best parameters for KNN regression using Scikit learn. Particularly what I am trying to do:

parameters = [{'weights': ['uniform', 'distance'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}]
clf = GridSearchCV(neighbors.KNeighborsRegressor(), parameters)
clf.fit(features, rewards)

Unfortunately I am getting ValueError: Array contains NaN or infinity.

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y, **params)
705                           " The params argument will be removed in 0.15.",
706                           DeprecationWarning)
--> 707         return self._fit(X, y, ParameterGrid(self.param_grid))
708 
709 

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
491                     X, y, base_estimator, parameters, train, test,
492                     self.scorer_, self.verbose, **self.fit_params)
--> 493                 for parameters in parameter_iterable
494                 for train, test in cv)
495 

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
515         try:
516             for function, args, kwargs in iterable:
--> 517                 self.dispatch(function, args, kwargs)
518 
519             self.retrieve()

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
310         """
311         if self._pool is None:
--> 312             job = ImmediateApply(func, args, kwargs)
313             index = len(self._jobs)
314             if not _verbosity_filter(index, self.verbose):

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
134         # Don't delay the application, to avoid keeping the input
135         # arguments in memory
--> 136         self.results = func(*args, **kwargs)
137 
138     def get(self):

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit_grid_point(X, y, base_estimator, parameters, train, test, scorer, verbose, loss_func, **fit_params)
309             this_score = scorer(clf, X_test, y_test)
310         else:
--> 311             this_score = clf.score(X_test, y_test)
312     else:
313         clf.fit(X_train, **fit_params)

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
320 
321         from .metrics import r2_score
--> 322         return r2_score(y, self.predict(X))
323 
324 

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in r2_score(y_true, y_pred)
2181 
2182     """
-> 2183     y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
2184 
2185     if len(y_true) == 1:

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_reg_targets(y_true, y_pred)
 59         Estimated target values.
 60     """
---> 61     y_true, y_pred = check_arrays(y_true, y_pred)
 62 
 63     if y_true.ndim == 1:

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
231                 else:
232                     array = np.asarray(array, dtype=dtype)
--> 233                 _assert_all_finite(array)
234 
235         if copy and array is array_orig:

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
 25     if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
 26             and not np.isfinite(X).all()):
---> 27         raise ValueError("Array contains NaN or infinity.")
 28 
 29 

ValueError: Array contains NaN or infinity.

Based on this post I have already tried to use following line with fit instead of the one that is above:

clf.fit(np.asarray(features).astype(float), np.asarray(rewards).astype(float))

Then based on this post I have tried even this:

scaler = preprocessing.StandardScaler().fit(np.asarray(features).astype(float))
transformed_features = scaler.transform(np.asarray(features).astype(float))
clf.fit(transformed_features, rewards)

But unfortunately without any success. So I would like to ask if anybody have some idea where possibly the problem can be and how can I make my code work.

Thank you very much in advance.

EDIT:

I have found out that I am not getting this error in case when I have only following parameters:

parameters = [{'weights': ['uniform'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}]

So it seems like the problem is in case when weights=distance. Does anybody have an idea why?

There has appeared one more problem related to this about which I'm asking here.

EDIT 2:

If I run my code with logging set on debug, I am getting following warning:

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/neighbors/regression.py:160: RuntimeWarning: invalid value encountered in divide
y_pred[:, j] = num / denom 

So there is clearly problem with division by zero. So my question is why there scikit divides by 0 on line 160 in regression.py?

Answers


Additionally to what you have tried, you can also see if

import numpy as np
features = np.nan_to_num(features)
rewards = np.nan_to_num(rewards)

This sets all non-numeric values in your arrays to 0, and should at least make your algorithm run, unless the error occurs somewhere internal to the algorithm. Make sure there aren't to many non-numeric entries in your data, as setting them all to 0 may cause strange biases in your estimates.

If this is not the case, and you are using weights='distance', then please check whether any of the train samples are identical. This will cause a division by zero in inverse distance.

If inverse distances are the cause of division by zero, you can circumvent this by using your own distance function, e.g.

def better_inv_dist(dist):
    c = 1.
    return 1. / (c + dist)

and then use 'weights': better_inv_dist. You may need to adapt the constant c to the right scale. In any case it will avoid division by zero as long as c > 0.


I ran into the same problem with KNN regression on scikit-learn. I was using weights='distance' and that led to infinite values while computing the predictions (but not while fitting the KNN model i.e. learning appropriate KD Tree or Ball Tree). I switched to weights='uniform' and the program ran to completion correctly, indicating the supplied weight function was the problem. If you want to use distance-based weights, supply a custom-weight function that doesn't explode to infinity at zero distance as indicated in eickenberg's answer.


Need Your Help

How to use SwingWorker to update the GUI in real-time?

java swing swingworker

I'm trying to present messages in a JTextArea whenever a method's tasks are done. My methods take some time. This is about connecting to the network or using I/O, you already know these tasks take ...

Getting a data from a table to insert to another database table

php mysql sql-server database laravel-5

I have this table that contains data from the database table 'patient'.