Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).

The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?

EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.

Resolved: See this mathoverflow post.


2 回答 2


我导出了熵和基尼指数的更新公式和算法,并在 arXiv 上发布了注释。(注释的工作版本可在此处获得。)另请参阅此 mathoverflow答案。

为了方便起见,我将包含简单的 Python 代码,演示派生公式:

from math import log
from random import randint

# maps x to -x*log2(x) for x>0, and to 0 otherwise 
h = lambda p: -p*log(p, 2) if p > 0 else 0

# update entropy if new example x comes in 
def update(H, S, x):
    new_S = S+x
    return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy(L) using only `update' function 
def test(L):
    S = 0.0 # sum of the sample elements
    H = 0.0 # sample entropy 
    for x in L:
        H = update(H, S, x)
        S = S+x
    return H

# compute entropy using the classic equation 
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])

# entry point 
if __name__ == "__main__":
    L = [randint(1,100) for k in range(100)]
    M = [randint(100,1000) for k in range(100)]

    L_ent = entropy(L)
    L_sum = sum(L)

    M_ent = entropy(M)
    M_sum = sum(M)

    T = L+M

    print "Full = ", entropy(T)
    print "Update = ", update(L_ent, L_sum, M_ent, M_sum)

于 2013-06-21T11:44:10.620 回答


K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
  = F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N)) 

因为log2(a / b)==而成立log2(a) - log2(b)

现在给定一个count迄今为止的旧观测向量和另一个名为 200 个新观测的向量batch,您可以在 C++11 中执行

void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
    N += batch.size();
    auto F = -1/(N * log2(N));
    for (auto b: batch)
    H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) { 
        return elem * log2(elem);

在这里,我假设您已将观察结果编码为int. 如果您有某种符号,则需要一个符号表std::map<Symbol, int>,并在batch更新count.

这似乎是为一般更新编写一些代码的最快方法。如果您知道在每个批次中只有少数计数实际发生变化,您可以像 @migdal 那样做并跟踪变化的计数,减去它们对熵的旧贡献并添加新贡献。

于 2013-06-14T09:08:00.867 回答