0

我正在 scipy 稀疏格式的文本语料库上训练来自 sklearn 0.14 的 RBM。拟合时,它会运行一段时间(几分钟),但随后会中断并抛出此错误:

IndexError: index out of bounds: 0 <= 199740 <= 199745, 0 <= 199750 <= 199745, 199740 <= 199750

(编辑 1:) 代码是:

cv = CountVectorizer(stop_words='english', min_df=MIN_DF, lowercase=True, ngram_range=NGRAM_RANGE, binary=True) # binary has to be true for the RBM to work 

print("Training...")
documents_transformed = cv.fit_transform(documents)
rbm = BernoulliRBM(n_components=N_COMPONENTS, learning_rate=LEARNING_RATE)
rbm.fit(documents_transformed)

documents_rbm_transformed = rbm.transform(documents_transformed)

完整的追溯是:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
202             else:
203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/clustering/cluster6.py in <module>()
 30 documents_transformed = cv.fit_transform(documents)
 31 rbm = BernoulliRBM(n_components=N_COMPONENTS, learning_rate=LEARNING_RATE)
---> 32 rbm.fit(documents_transformed)
 33 
 34 documents_rbm_transformed = rbm.transform(documents_transformed)

/anaconda/lib/python2.7/site-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304 
305             for batch_slice in batch_slices:
--> 306                 pl_batch = self._fit(X[batch_slice], rng)
307 
308                 if verbose:

/anaconda/lib/python2.7/site-packages/scipy/sparse/csc.pyc in __getitem__(self, key)
148         if (isinstance(row, slice) or isinstance(col, slice) or
149             isintlike(row) or isintlike(col)):
--> 150             return self.T[col, row].T
151         # Things that return a sequence of values.
152         else:

/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in __getitem__(self, key)
246                      row.step in (1, None))):
247                 # col is int or slice with step 1, row is slice with step 1.
--> 248                 return self._get_submatrix(row, col)
249             elif issequence(col):
250                 P = extractor(col,self.shape[1]).T        # [1:2,[1,2]]

/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in _get_submatrix(self, row_slice, col_slice)
399         j0, j1 = process_slice(col_slice, N)
400         check_bounds(i0, i1, M)
--> 401         check_bounds(j0, j1, N)
402 
403         indptr, indices, data = get_csr_submatrix(M, N,

/anaconda/lib/python2.7/site-packages/scipy/sparse/csr.pyc in check_bounds(i0, i1, num)
394                       "index out of bounds: 0 <= %d <= %d, 0 <= %d <= %d,"
395                        " %d <= %d" %
--> 396                       (i0, num, i1, num, i0, i1))
397 
398         i0, i1 = process_slice(row_slice, M)

IndexError: index out of bounds: 0 <= 199740 <= 199745, 0 <= 199750 <= 199745, 199740 <= 199750

这里有 199745 个训练示例,我不确定它为什么会超出这些范围。为什么会这样?我应该如何解决它?

4

0 回答 0