3

Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).

The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?

EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.

Resolved: See this mathoverflow post.

4

2 回答 2

1

我导出了熵和基尼指数的更新公式和算法,并在 arXiv 上发布了注释。(注释的工作版本可在此处获得。)另请参阅此 mathoverflow答案。

为了方便起见,我将包含简单的 Python 代码,演示派生公式:

from math import log
from random import randint

# maps x to -x*log2(x) for x>0, and to 0 otherwise 
h = lambda p: -p*log(p, 2) if p > 0 else 0

# update entropy if new example x comes in 
def update(H, S, x):
    new_S = S+x
    return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy(L) using only `update' function 
def test(L):
    S = 0.0 # sum of the sample elements
    H = 0.0 # sample entropy 
    for x in L:
        H = update(H, S, x)
        S = S+x
    return H

# compute entropy using the classic equation 
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])

# entry point 
if __name__ == "__main__":
    L = [randint(1,100) for k in range(100)]
    M = [randint(100,1000) for k in range(100)]

    L_ent = entropy(L)
    L_sum = sum(L)

    M_ent = entropy(M)
    M_sum = sum(M)

    T = L+M

    print "Full = ", entropy(T)
    print "Update = ", update(L_ent, L_sum, M_ent, M_sum)

于 2013-06-21T11:44:10.620 回答
-1

您可以通过重新计算计数并使用一些简单的数学恒等式来简化熵公式来重新计算熵

K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
  = F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N)) 

因为log2(a / b)==而成立log2(a) - log2(b)

现在给定一个count迄今为止的旧观测向量和另一个名为 200 个新观测的向量batch,您可以在 C++11 中执行

void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
{
    N += batch.size();
    auto F = -1/(N * log2(N));
    for (auto b: batch)
       ++count[b];
    H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) { 
        return elem * log2(elem);
    });
}

在这里,我假设您已将观察结果编码为int. 如果您有某种符号,则需要一个符号表std::map<Symbol, int>,并在batch更新count.

这似乎是为一般更新编写一些代码的最快方法。如果您知道在每个批次中只有少数计数实际发生变化,您可以像 @migdal 那样做并跟踪变化的计数,减去它们对熵的旧贡献并添加新贡献。

于 2013-06-14T09:08:00.867 回答