# Scikit grid search for KNN regression ValueError: Array contains NaN or infinity

I am trying to implement grid search for selecting best parameters for KNN regression using Scikit learn. Particularly what I am trying to do:

parameters = [{'weights': ['uniform', 'distance'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}] clf = GridSearchCV(neighbors.KNeighborsRegressor(), parameters) clf.fit(features, rewards)

Unfortunately I am getting ValueError: Array contains NaN or infinity.

/Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y, **params) 705 " The params argument will be removed in 0.15.", 706 DeprecationWarning) --> 707 return self._fit(X, y, ParameterGrid(self.param_grid)) 708 709 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable) 491 X, y, base_estimator, parameters, train, test, 492 self.scorer_, self.verbose, **self.fit_params) --> 493 for parameters in parameter_iterable 494 for train, test in cv) 495 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable) 515 try: 516 for function, args, kwargs in iterable: --> 517 self.dispatch(function, args, kwargs) 518 519 self.retrieve() /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs) 310 """ 311 if self._pool is None: --> 312 job = ImmediateApply(func, args, kwargs) 313 index = len(self._jobs) 314 if not _verbosity_filter(index, self.verbose): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs) 134 # Don't delay the application, to avoid keeping the input 135 # arguments in memory --> 136 self.results = func(*args, **kwargs) 137 138 def get(self): /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit_grid_point(X, y, base_estimator, parameters, train, test, scorer, verbose, loss_func, **fit_params) 309 this_score = scorer(clf, X_test, y_test) 310 else: --> 311 this_score = clf.score(X_test, y_test) 312 else: 313 clf.fit(X_train, **fit_params) /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y) 320 321 from .metrics import r2_score --> 322 return r2_score(y, self.predict(X)) 323 324 /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in r2_score(y_true, y_pred) 2181 2182 """ -> 2183 y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred) 2184 2185 if len(y_true) == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.pyc in _check_reg_targets(y_true, y_pred) 59 Estimated target values. 60 """ ---> 61 y_true, y_pred = check_arrays(y_true, y_pred) 62 63 if y_true.ndim == 1: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options) 231 else: 232 array = np.asarray(array, dtype=dtype) --> 233 _assert_all_finite(array) 234 235 if copy and array is array_orig: /Users/zikesjan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X) 25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) 26 and not np.isfinite(X).all()): ---> 27 raise ValueError("Array contains NaN or infinity.") 28 29 ValueError: Array contains NaN or infinity.

Based on this post I have already tried to use following line with fit instead of the one that is above:

clf.fit(np.asarray(features).astype(float), np.asarray(rewards).astype(float))

Then based on this post I have tried even this:

scaler = preprocessing.StandardScaler().fit(np.asarray(features).astype(float)) transformed_features = scaler.transform(np.asarray(features).astype(float)) clf.fit(transformed_features, rewards)

But unfortunately without any success. So I would like to ask if anybody have some idea where possibly the problem can be and how can I make my code work.

Thank you very much in advance.

**EDIT:**

I have found out that I am not getting this error in case when I have only following parameters:

parameters = [{'weights': ['uniform'], 'n_neighbors': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}]

So it seems like the problem is in case when weights=distance. Does anybody have an idea why?

There has appeared one more problem related to this about which I'm asking here.

**EDIT 2:**

If I run my code with logging set on debug, I am getting following warning:

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/neighbors/regression.py:160: RuntimeWarning: invalid value encountered in divide y_pred[:, j] = num / denom

So there is clearly problem with division by zero. So my question is why there scikit divides by 0 on line 160 in regression.py?

## Answers

Additionally to what you have tried, you can also see if

import numpy as np features = np.nan_to_num(features) rewards = np.nan_to_num(rewards)

This sets all non-numeric values in your arrays to 0, and should at least make your algorithm run, unless the error occurs somewhere internal to the algorithm. Make sure there aren't to many non-numeric entries in your data, as setting them all to 0 may cause strange biases in your estimates.

If this is not the case, and you are using weights='distance', then please check whether any of the train samples are identical. This will cause a division by zero in inverse distance.

If inverse distances are the cause of division by zero, you can circumvent this by using your own distance function, e.g.

def better_inv_dist(dist): c = 1. return 1. / (c + dist)

and then use 'weights': better_inv_dist. You may need to adapt the constant c to the right scale. In any case it will avoid division by zero as long as c > 0.

I ran into the same problem with KNN regression on scikit-learn. I was using weights='distance' and that led to infinite values while computing the predictions (but not while fitting the KNN model i.e. learning appropriate KD Tree or Ball Tree). I switched to weights='uniform' and the program ran to completion correctly, indicating the supplied weight function was the problem. If you want to use distance-based weights, supply a custom-weight function that doesn't explode to infinity at zero distance as indicated in eickenberg's answer.