0

我有一些英文书面文本并计算了它的熵。但是我意识到基于 LZ 方法的压缩算法在熵给出的限制下压缩得很多。

这是因为模拟英文文本的信息源具有记忆。因此,压缩的边界由熵率给出,而不是由该源的熵给出。

我看到了具有内存的源的熵率的定义,但想知道如何使用算法或用英文编写的文本的伪代码来计算熵率。

有任何想法吗?

感谢帮助。

4

1 回答 1

1

In general, estimating the entropy rate of a source with memory is a hard problem, and when there are lots of long-distance dependencies (as there are in natural language) it's especially difficult. Essentially what you need to do is to construct a grammar of the language based on the sample, and then calculate the entropy rate of the grammar. The particular assumptions that you make when extracting that grammar are going to make a big difference in the entropy rate you end up with, so the best you'll be able to do is to estimate some fairly weak bounds on the entropy rate of the actual source.

A common practice is to estimate the entropy rate simply by compressing a large sample using a standard compression program like gzip, though as Cosma Shalizi points out that's a terrible idea. If you're going to use a general data compression algorithm, LZ76 is a better choice; Fermín Moscoso del Prado has a paper coming out looking at some of the alternatives.

While compression algorithms do sort of give you a kind of grammar, it would even better to use a grammar that accurately captures the long-distance dependencies in the language. Unfortunately, learning realistic natural language grammars from samples of raw text is very much an open research problem. But, there is some promising work on learning finite state approximations to natural language which could be used to produce good entropy rate estimates. Check out CSSR for one way of approaching the problem along those lines.

于 2011-03-11T17:04:47.690 回答