除了计算文档中单词的频率之外,我还想计算与单词相关联的不同 id 的数量。用一个例子更容易解释:
from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
'words' : Series(["apple banana apple strawberry banana lemon",
"apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)
>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon
# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
"count": Series(wc.values())}
dfwc = DataFrame(dwc)
>>> dfwc
count word
0 2 kiwi
1 1 strawberry
2 3 lemon
3 3 apple
4 4 banana
计算词频部分很简单,如上图。我想做的是获得如下输出,它给出了与每个单词相关联的不同 id 的数量:
count word ids
0 2 kiwi 1
1 1 strawberry 1
2 3 lemon 2
3 3 apple 1
4 4 banana 3
理想情况下,我希望它与计算词频同时进行。但我不确定如何整合它。
任何指针将不胜感激!