获取两个来源中出现的单词很容易:
(set(dict1) & set(dict2))
传递 dict 来set
创建一组字典的键,然后&
是集合交集运算符。
我们可以为统计显着性做的最简单的测试是卡方检验,使用一个虚拟变量来比较每个共同单词的“one vs all”计数。您可以使用scipy
. 把它们放在一起你可以做这样的事情:
from scipy.stats import chisquare
import numpy as np
dict1 = { 'cat': 20, 'dog': 40 }
dict2 = { 'cat': 22, 'dog': 38 }
def get_freqs_for_chisq(dict1, dict2):
for key in (set(dict1) & set(dict2)):
yield key
for d in [dict1, dict2]:
other_freq = sum([v for (k,v) in d.iteritems() if k != key])
freq = d[key]
yield np.array([freq, other_freq])
iter = get_freqs_for_chisq(dict1, dict2)
results = {}
while True:
try:
word = iter.next()
results[word] = dict(zip(('chisq', 'P'),
chisquare(iter.next(), f_exp=iter.next())))
except StopIteration:
break
这会给你这样的输出:
{'cat': {'P': 0.59209697588539778, 'chisq': 0.28708133971291866},
'dog': {'P': 0.59209697588539778, 'chisq': 0.28708133971291866}}