python - 用 scipy.sparse 计数

Question

我正在使用 Python sklearn 库。我有 150,000 多个句子。

我需要一个类似数组的对象，其中每一行代表一个句子，每一列对应一个单词，每个元素是该句子中单词的数量。

例如：如果这两个句子是“The dog ran”和“The boy ran”，我需要

[ [1, 1, 1, 0]
, [0, 1, 1, 1] ]

（列的顺序无关紧要，取决于哪个列分配给哪个单词）

我的数组将是稀疏的（每个句子都有一小部分可能的单词），所以我使用的是 scipy.sparse。

def word_counts(texts, word_map):
    w_counts = sp.???_matrix((len(texts),len(word_map)))

    for n in range(0,len(texts)-1):
        for word in re.findall(r"[\w']+", texts[n]):
            index = word_map.get(word)
            if index != None:
                w_counts[n,index] += 1
    return w_counts

...
nb = MultinomialNB() #from sklearn
words = features.word_list(texts)
nb.fit(features.word_counts(texts,words), classes)

我想知道什么稀疏矩阵最好。

我尝试使用 coo_matrix 但出现错误：

TypeError：“coo_matrix”对象没有属性“__getitem__”

我查看了COO的文档，但对以下内容感到非常困惑：

稀疏矩阵可用于算术运算……
COO格式的缺点……不直接支持：算术运算

我使用了 dok_matrix，效果很好，但我不知道这在这种情况下是否表现最好。

提前致谢。

score 6 · Accepted Answer

尝试lil_matrix要么dok_matrix; 这些很容易构建和检查（但在的情况下lil_matrix，可能会非常慢，因为每次插入都需要线性时间）。接受稀疏矩阵的 Scikit-learn 估计器将接受任何格式并在内部将它们转换为有效的格式（通常csr_matrix）。您也可以使用矩阵上的方法tocoo、todok等自己进行转换。tocsrscipy.sparse

或者，只需使用scikit-learn 提供的CountVectorizeror类来实现此目的。将整个文档作为输入：DictVectorizerCountVectorizer

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> documents = ["The dog ran", "The boy ran"]
>>> vectorizer = CountVectorizer(min_df=0)
>>> vectorizer = CountVectorizer(min_df=0, stop_words=[])
>>> X = CountVectorizer.fit_transform(documents)
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[0, 1, 1, 1],
       [1, 0, 1, 1]])

...虽然DictVectorizer假设您已经完成了标记化和计数，结果在dict每个样本中：

>>> from sklearn.feature_extraction import DictVectorizer
>>> documents = [{"the":1, "boy":1, "ran":1}, {"the":1, "dog":1, "ran":1}]
>>> X = vectorizer.fit_transform(documents)
>>> X.toarray()
array([[ 1.,  0.,  1.,  1.],
       [ 0.,  1.,  1.,  1.]])
>>> vectorizer.inverse_transform(X[0])
[{'ran': 1.0, 'boy': 1.0, 'the': 1.0}]

（min_df参数 toCountVectorizer是在几个版本前添加的。如果您使用的是旧版本，请忽略它，或者更确切地说，升级。）

编辑根据常见问题解答，我必须公开我的从属关系，所以这里是：我DictVectorizer是CountVectorizer.

python - 用 scipy.sparse 计数

1 回答 1

Related

Reference