python - 信息论测度：熵计算

Question

我有一个由数千行组成的语料库。为简单起见，让我们将语料库视为：

Today is a good day
I hope the day is good today
It's going to rain today
Today I have to study

如何使用上面的语料库计算熵？熵的公式如下：

到目前为止，这是我的理解： Pi 是指计算为的单个符号的概率frequency(P) / (total num of characters)。我不明白的是总和？我不确定在这个特定公式中求和是如何工作的？

我Python 3.5.2用于统计数据分析。如果有人可以帮助我提供熵计算的代码片段，那就太好了。

score 2 · Accepted Answer

您可以使用 SciPy https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html来计算熵。

或者写这样的东西：

import math
def Entropy(string,base = 2.0):
    #make set with all unrepeatable symbols from string
    dct = dict.fromkeys(list(string))

    #calculate frequencies
    pkvec =  [float(string.count(c)) / len(string) for c in dct]

    #calculate Entropy
    H = -sum([pk  * math.log(pk) / math.log(base) for pk in pkvec ])
    return H


print(Entropy("Python is not so easy"))

它返回 3.27280432733。

python - 信息论测度：熵计算

1 回答 1

Related

Reference