python - 返回 CountVectorizer 中对 scikit learn 中的特定功能具有非零条目的行的索引

Question

我一直在搜索 Python 的 sklearn 包的文档。

我用我的语料库创建了一个经过拟合和转换的 CountVectorizer 对象。

我正在寻找一个函数，它可以为某些特定列返回具有非零条目的所有行的索引。

因此，如果我的 CountVectorizer 中的行由音乐评论组成，而列由特征组成（例如，有一列用于计数单词“lyrics”），那么 sci kit-learn 中是否有一个函数可以返回包含这个词的音乐评论的索引？

我查看了该inverse_transform(X)功能，它没有执行此功能。

我怀疑我不是第一个对这个功能感兴趣的人。

sklearn 中是否存在这样的功能，如果没有，是否有其他对类似程序感兴趣的人提出了如何实现此功能的好方法？

提前致谢。

更新：

我最好的解决方案涉及迭代列数（在我的例子中，我有 100 个特征）：

for i in range(99):
    print X.indices[X.indptr[i]:X.indptr[i+1]]

但这看起来很浪费，因为它是迭代的并且范围必须是硬编码的，并且它为稀疏列返回空列表。

score 2 · Accepted Answer

我在文档中也没有看到可以做到这一点的函数，但这应该对你有用：

def lookUpWord(vec,dtm,word):
    i = vec.get_feature_names().index(word)
    return dtm[:,i].nonzero()[0]

这是一个简单的例子：

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> 
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?'
...     ]
>>> 
>>> X = CountVectorizer()
>>> Y = X.fit_transform(corpus)
>>> lookUpWord(X,Y,'first')
array([0, 3], dtype=int32)

python - 返回 CountVectorizer 中对 scikit learn 中的特定功能具有非零条目的行的索引

1 回答 1

Related

Reference