text - Compressing text using recursive N-Grams

Question

I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

Then using the number "12" we can work backwards (if we have the same dataset)

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.

Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).

score 2 · Accepted Answer

为什么需要“二字词”来实现压缩？如果这不是严格要求，则有多种方法可以压缩具有不同场景的文本数据。这些大多被称为字典预处理。以下是可以应用于您的案例的列表：

计算单词出现次数并按频率降序对它们进行排序。您可以使用自定义编码方法使用前 N 个单词，其中 N 可由用户配置。您甚至可以使用动态编程等来优化 N。在实际编码中，编码一个标志以指示下一个符号是字典词还是直接编码的词。
构建二字或三字字符组合（包括空格、标点符号等）的直方图。然后使用未使用的字节值来编码那些经常看到的二元组或三元组。您甚至可以使用递归方法一遍又一遍地扫描以减少源文件。

就您而言，如果您考虑上述方法，则效率低下。因为，您似乎并不认为您需要一个非常大的数据来解码您的编码数据。要了解大多数压缩思想，最好编写一个非常简单的测试程序来分析它的输出。你最终会得到一个更强大、更稳定的算法。

这里有一些字典预处理器，我想到这些预处理器只是为了给你一个参考：

XWRT：最先进的字典预处理器之一。
DICT：高性能FreeArc归档器的预处理器（它是开源的）。有一篇关于它的文章。不幸的是，它是俄语的。
KWC：一个简单的测试字典预处理器，它用字典代码替换 6 克代码。看这里的讨论。
bpe2 V3：它基于 n-gram 替换。其他版本：V1，V2。此外，还有关于它的讨论。

score 1 · Accepted Answer

简而言之，是的，可能的序列数量可能太多而无法有效地做到这一点。更大的问题是这些单词映射以及每个映射之后的 n-gram 需要存储在某个地方，这将大大超过实际“压缩”的任何节省。

text - Compressing text using recursive N-Grams

2 回答 2

Related

Reference