python - Python - 清理数据以运行先验算法

Question

我有一组文章中使用的所有单词的主列表，现在我正在尝试计算每篇文章中主列表中每个单词的出现次数。然后我将尝试在数据上建立一些关联规则。例如，我的数据可能如下所示：

master_wordlist = ['dog', 'cat', 'hat', 'bat', 'big']
article_a = ['dog', 'cat', 'dog','big']
article_b = ['dog', 'hat', 'big', 'big', 'big']

我需要将我的数据转换成这种格式：

Article        dog    cat    hat    bat    big
article_a      2      1      0      0      1
article_b      1      0      1      0      3

我正在努力进行这种转换，我一直在玩 nltk，但我不知道如何计算其中包含不存在的单词的计数。任何帮助将不胜感激！

score 1 · Accepted Answer

你可以collections.Counter在这里使用：

from collections import Counter
master_wordlist = ['dog', 'cat', 'hat', 'bat', 'big']
article_a = ['dog', 'cat', 'dog','big']
article_b = ['dog', 'hat', 'big', 'big', 'big']

c_a = Counter(article_a)
c_b = Counter(article_b)

print [c_a[x] for x in master_wordlist]
print [c_b[x] for x in master_wordlist]

输出：

[2, 1, 0, 0, 1]
[1, 0, 1, 0, 3]

python - Python - 清理数据以运行先验算法

1 回答 1

Related

Reference