4

确实是一个简单的问题,但我似乎无法破解它。我有一个按以下方式格式化的字符串:

["category1",("data","data","data")]
["category2", ("data","data","data")]

我将不同类别的帖子称为帖子,我想从数据部分中获取最常用的单词。所以我尝试了:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

但是,这将为我提供字符串中每个帖子的最热门单词。

我需要一个通用的热门单词列表。
但是,如果我将 print top 从 for 循环中取出,它只会给我上一篇文章的结果。
有人有想法吗?

4

4 回答 4

3

这是一个范围问题。此外,您不需要初始化 的元素defaultdict,因此这简化了您的代码:

试试这样:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

正如预期的那样,这会输出

['data1', 'data3', 'data5', 'data2']

因此。

如果你真的有类似的东西

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

作为输入,您不需要wordpunct_tokenize(),因为输入数据已经被标记化。然后,以下将起作用:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

它还输出预期的结果:

['data1', 'data3', 'data5', 'data2']
于 2013-05-04T14:38:18.177 回答
3

为什么不只使用Counter

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
于 2013-05-04T14:53:26.620 回答
2
from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

输出:

[('a', 4), ('yellow', 2), ('quick', 2)]

正如您在Counter.most_common的文档中看到的那样,返回的列表已排序。

要与您的代码一起使用,您可以执行

texts = (x[1] for x in posts)

或者你可以做

... wordpunct_tokenize(x[1]) for x in texts ...

如果您的帖子实际上是这样的:

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

您可以摆脱以下类别:

texts = list(chain.from_iterable(x[1] for x in posts))

texts['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer']

然后,您可以在此答案顶部的片段中使用它。

于 2013-05-04T14:52:23.923 回答
1

只需更改您的代码以允许处理帖子,然后获取最热门的单词:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
于 2013-05-04T14:38:09.253 回答