python - 基于 Lempel-Ziv 算法的熵估计器，使用 Python

Question

此函数允许估计时间序列的熵。它基于 Lempel-Ziv 压缩算法。对于长度为 n 的时间序列，熵估计为：

E= (1/n SUM_i L_i )^-1 ln(n)

其中 L_i 是从位置 i 开始的最短子串的长度，该子串以前没有出现在位置 1 到 i-1 之间。当 n 接近无穷大时，估计的熵收敛于时间序列的真实熵。

MATLAB函数中已经有一个实现： https ://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk

我想在 Python 中实现，我是这样做的：

def contains(small, big):
    for i in range(len(big)-len(small)+1):
        if big[i:i+len(small)] == small:
            return True
    return False

def actual_entropy(l):
    n = len(l)
    sequence = [l[0]]
    sum_gamma = 0

    for i in range(1, n):
        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                sequence.append(l[i])
                break

    ae = 1 / (sum_gamma / n ) * math.log(n)            
    return ae

但是，我发现当数据量越来越大时，它的计算速度太慢了。例如，我使用 23832 个元素的列表作为输入，消耗的时间是这样的：（数据可以在这里找到）

0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...

我有成千上万个这样的列表要计算，这么长的时间是无法忍受的。我应该如何优化此功能并使其更快地工作？

score 3 · Accepted Answer

我玩了一下，并从StackOverflow 上的另一个线程尝试了几种不同的方法。这是我想出的代码：

def contains(small, big):
    try:
        big.tostring().index(small.tostring())//big.itemsize
        return True
    except ValueError:
        return False

def actual_entropy(l):
    n = len(l)
    sum_gamma = 0

    for i in range(1, n):
        sequence = l[:i]

        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                break

    ae = 1 / (sum_gamma / n) * math.log(n)
    return ae

有趣的是，将 numpy 数组转换为字符串比直接使用字符串要快。使用您提供的数据对我的机器上的代码进行非常粗略的基准测试是：

   N:  my code - your code
1000:   0.039s -    1.039s
2000:   0.266s -   18.490s
3000:   0.979s -   74.761s
4000:   2.891s -  285.488s

如果您并行化外循环，您也许可以使这更快。

python - 基于 Lempel-Ziv 算法的熵估计器，使用 Python

1 回答 1

Related

Reference