我正在 scipy 稀疏格式的文本语料库上训练来自 sklearn 0.14 的 RBM。拟合时,它会运行一段时间(几分钟),但随后会中断并抛出此错误:
IndexError: index out of bounds: 0 <= 199740 <= 199745, 0 <= 199750 <= 199745, 199740 <= 199750
(编辑 1:) 代码是:
cv = CountVectorizer(stop_words='english', min_df=MIN_DF, lowercase=True, ngram_range=NGRAM_RANGE, binary=True) # binary has to be true for the RBM to work
print("Training...")
documents_transformed = cv.fit_transform(documents)
rbm = BernoulliRBM(n_components=N_COMPONENTS, learning_rate=LEARNING_RATE)
rbm.fit(documents_transformed)
documents_rbm_transformed = rbm.transform(documents_transformed)
完整的追溯是:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
202 else:
203 filename = fname
--> 204 __builtin__.execfile(filename, *where)
/clustering/cluster6.py in <module>()
30 documents_transformed = cv.fit_transform(documents)
31 rbm = BernoulliRBM(n_components=N_COMPONENTS, learning_rate=LEARNING_RATE)
---> 32 rbm.fit(documents_transformed)
33
34 documents_rbm_transformed = rbm.transform(documents_transformed)
/anaconda/lib/python2.7/site-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304
305 for batch_slice in batch_slices:
--> 306 pl_batch = self._fit(X[batch_slice], rng)
307
308 if verbose:
/anaconda/lib/python2.7/site-packages/scipy/sparse/csc.pyc in __getitem__(self, key)
148 if (isinstance(row, slice) or isinstance(col, slice) or
149 isintlike(row) or isintlike(col)):
--> 150 return self.T[col, row].T
151 # Things that return a sequence of values.
152 else:
/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in __getitem__(self, key)
246 row.step in (1, None))):
247 # col is int or slice with step 1, row is slice with step 1.
--> 248 return self._get_submatrix(row, col)
249 elif issequence(col):
250 P = extractor(col,self.shape[1]).T # [1:2,[1,2]]
/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in _get_submatrix(self, row_slice, col_slice)
399 j0, j1 = process_slice(col_slice, N)
400 check_bounds(i0, i1, M)
--> 401 check_bounds(j0, j1, N)
402
403 indptr, indices, data = get_csr_submatrix(M, N,
/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in check_bounds(i0, i1, num)
394 "index out of bounds: 0 <= %d <= %d, 0 <= %d <= %d,"
395 " %d <= %d" %
--> 396 (i0, num, i1, num, i0, i1))
397
398 i0, i1 = process_slice(row_slice, M)
IndexError: index out of bounds: 0 <= 199740 <= 199745, 0 <= 199750 <= 199745, 199740 <= 199750
这里有 199745 个训练示例,我不确定它为什么会超出这些范围。为什么会这样?我应该如何解决它?