python - 如何从文本数据中获取词袋？

Question

我正在使用大型文本数据集解决预测问题。我正在实施词袋模型。

获得词袋的最佳方式应该是什么？现在，我有各种单词的tf-idf，但单词数量太大，无法用于进一步的作业。如果我使用 tf-idf 标准，获取词袋的 tf-idf 阈值应该是多少？或者我应该使用其他一些算法。我正在使用python。

score 30 · Accepted Answer

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
             'John also likes to watch football games.']
>>> bagsofwords = [collections.Counter(re.findall(r'\w+', txt))
                   for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

score 18 · Accepted Answer

词袋可以定义为一个矩阵，其中每一行代表一个文档，列代表单个标记。还有一件事，没有保持文本的顺序。构建“词袋”包括 3 个步骤

标记化
数数
规范化

要记住的限制： 1. 无法捕捉短语或多词表达 2. 对拼写错误敏感，可以使用拼写校正器或字符表示来解决这个问题，

例如

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())

score 5 · Accepted Answer

词袋模型是一种很好的文本表示方法，可以应用于不同的机器学习任务。但在第一步中，您需要从不必要的数据中清理数据，例如标点符号、html 标记、停用词……对于这些任务，您可以轻松利用Beautiful Soup（删除 HTML 标记）或NLTK（在 Python 中删除停用词）。清理数据后，您需要创建向量特征（用于机器学习的数据的数值表示），这就是词袋发挥作用的地方。scikit-learn有一个模块（feature_extraction模块），可以帮助您创建词袋特征。

您可能会在本教程中找到所有您需要的详细信息，这也很有帮助。我发现它们都非常有用。

score 2 · Accepted Answer

正如其他人已经提到的那样，nltk如果您想要稳定且可扩展的东西，那么使用将是您的最佳选择。它是高度可配置的。

但是，如果您想调整默认值，它的缺点是学习曲线非常陡峭。

我曾经遇到过想要一袋话的情况。问题是，它涉及有关技术的文章，这些技术具有充满异国情调的名称-，_等等。例如vue-router或_.js等。

例如，nltk 的默认配置word_tokenize是拆分vue-router为两个单独vue的router单词。我什至不谈论_.js。

因此，为了它的价值，我最终编写了这个小例程list，根据我自己的标点符号标准将所有单词标记为 a 。

import re

punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]

print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']

这个例程可以很容易地与 Patty3118 answer about 结合使用，例如，这可以让您知道文章中提到collections.Counter的次数。_.js

score 0 · Accepted Answer

从一本书“机器学习python”中：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['blablablatext'])
bag = count.fit_transform(docs)

python - 如何从文本数据中获取词袋？

5 回答 5

Related

Reference