python - 压缩大量相似的字符串

Question

我正在查看一个维护大量字符串的 python 脚本。这些字符串实际上是回归测试运行中记录文件名的完整路径。文件数量太大，脚本开始达到虚拟地址空间限制。

像这样的字符串几乎没有熵，所以我认为压缩它们会很好。问题是我考虑的压缩库（如Python - Compress Ascii String中讨论的那些）似乎单独压缩每个字符串（存储解压缩所需的所有信息）。常识表明，将整个字符串集压缩到单个全局 blob/字典中并使用某种短引用来引用单个字符串会更有效。

这是一些（伪）代码来阐明这个想法：

class TestResult:
    def __init__(self):
        self._Id         = None
        ....
        self.__appLogList = []
        return
....
    def setAppLogList(self, currentAppLogList):
        self.__appLogList = currentAppLogList
        # what I want to do here instead is 
        # for each s in currentAppLogList
        # add s to global compressed blob and
        # store reference in __appLogList

    def getAppLogList(self):
        return self.__appLogList
        self.__appLogList = currentAppLogList
        # what I want to do here instead is 
        # currentAppLogList = []
        # for each ref in __appLogList
        # decompress ref into string s and add s to currentAppLogList
        # return currentAppLogList

# for each test
# add new TestResult to result map
# setAppLogList

# for selected tests
# getAppLogList and access logs

使用现有的公开可用的 Python 库可以做到这一点吗？

score 2 · Accepted Answer

这是我能想到的最简单的方法。它通过仅在字典中存储每个唯一目录路径一次来“压缩”数据。任何给定目录中的所有文件都存储在list. 出于测试目的，下面的示例代码仅读取由每行一个完整文件路径组成的输入文件。

from collections import defaultdict
import os

directories = defaultdict(list)

with open('log_file_paths.txt') as inf:
    for path in (line.strip() for line in inf):
        dir_path, file_name = os.path.split(path)
        directories[dir_path].append(file_name)

for path in directories:
    print path
    for file_name in directories[path]:
        print '   ', file_name

score 2 · Accepted Answer

This is a variation or extension of my previous answer. The primary difference is that it's able to "compress" the strings potentially much more by taking advantage of the fact that they're actually file paths with possibly one or more common sub-components. This redundancy is eliminated by building a tree data structure from this information meaning that for any given level, each unique component value is only stored once. The tree data structure itself is implemented using what is essentially a dictionary of dictionaries. Its creation ought to be relatively fast due to it use of the built-in reduce() function.

The updated version below now has the property that -- unlike with previous one -- what's created can be pickled, which means it can be saved and read back intact later from a file.

Like my other answer, this isn't doing what you call "real" compression, however I think doing something like might be overkill for what you're trying to accomplish since this would solve your problem, is relatively quick, plus would be much simpler to maintain, enhance, and extend later.

import collections
import cPickle as pickle  # use C version for performance
import operator
import os

class DefaultDict(collections.defaultdict):
    """  pickle-able defaultdict """
    def __reduce__(self):
        args = (self.default_factory,) if self.default_factory else tuple()
        return type(self), args, None, None, self.iteritems()

def Tree(): return DefaultDict(Tree)
class Leaf(object): pass

def print_tree(d, level=0, indent=' '*4):
    """ Tree structure pretty printer """
    if not d:
        print indent * level + '<empty>'
    else:
        for key, value in sorted(d.iteritems()):
            print indent * level + str(key)
            if isinstance(value, dict):
                print_tree(value, level+1, indent)
            elif not isinstance(value, Leaf):
                print indent * (level+1) + str(value)

# create Tree structure from paths
tree = Tree()
LEAF = Leaf()
with open('log_file_paths.txt') as inf:
    for line in inf:
        path = line.strip().split(os.sep)
        reduce(operator.getitem, path[:-1], tree)[path[-1]] = LEAF
print_tree(tree)

# save tree to a file
with open('file_tree.pk', 'wb') as outf:
    pickle.dump(tree, outf, pickle.HIGHEST_PROTOCOL)

# try reading it back in
del tree
with open('file_tree.pk', 'rb') as inf:
    tree = pickle.load(inf)
print_tree(tree)  # display reconstituted tree

So if the input file consisted of these file paths:

tests/first/set1.log
tests/first/set2.log
tests/first/subfirst/set3.log
tests/first/subfirst/set4.log
tests/second/set5.log
tests/second/subsecond/set6.log
tests/second/subsecond/subsubsecond/set7.log
tests/third/set8.log

The following tree structure is what gets created in memory and displayed (and later written, read back, and redisplayed) to hold them:

tests
    first
        set1.log
        set2.log
        subfirst
            set3.log
            set4.log
    second
        set5.log
        subsecond
            set6.log
            subsubsecond
                set7.log
    third
        set8.log

score 1 · Accepted Answer

现在，我只有一个 java 解决方案。您可以将其用作独立脚本。请将此处代码中的值更改为文件的路径并执行。

您可以根据需要修改搜索方法，该项目还有一个 ant 脚本，您可以根据需要自定义。

例如，如果您正在搜索文件名：

sipproxy0-dc0-auto-node-aware-10.20131021_213407_288.snapshot.log

您必须提供完整的路径，例如：

C:\ccview\aglagole_2_cloudute\tserver_ddpsf\siptserver\CloudUte\unittest\worker_log\agragole\logs\sipproxy0-dc0-auto-node-aware-10.20131021_213407_288.snapshot.log

为了降低内存使用量，我建议您为每行中的公共部分生成一个哈希（Google 的 Guava 库有一个非常好的实现）

C:\ccview\aglagole_2_cloudute\tserver_ddpsf\siptserver\CloudUte\unittest\worker_log\aglagole\logs

如果您遇到任何内存不足的问题，请告诉我。

score 1 · Accepted Answer

我发现没有现成的实现可以在时间限制内应用问题。此时似乎不存在。以下是我考虑的一些方向。

后缀数组是建议的特里树的当代替代品。后缀数组可用于非常有效地实现子字符串搜索。但是，目前尚不清楚如何将它们用于我的问题。后缀数组概述：Puglisi, SJ, Smyth, WF, & Turpin, AH (2007)。后缀数组构造算法的分类。ACM 计算调查 (CSUR), 39(2), 4。

这项研究直接适用于我的问题：Brisaboa, NR, Cánovas, R., Claude, F., Martínez-Prieto, MA, & Navarro, G. (2011)。压缩字符串字典。在实验算法中（第 136-147 页）。施普林格柏林海德堡。不幸的是，没有实现下载和重用。

基于语法的解析器：Larsson, NJ 和 Moffat, A. (2000)。基于离线字典的压缩。IEEE 会议记录，88（11），1722-1732。实施是可用的。为了计算语法，所有字符串都必须在内存中。我大概可以收集前 1000 个字符串并计算语法；或者，我可以完全运行（直到内存失败），记录所有字符串并计算语法。可能的问题：字符串随着时间的推移而演变（名称包括日期/时间）。不是最佳压缩率；此外，如果有一个字符串不符合预先计算的语法，会发生什么并不清楚。也许这样的字符串根本不会被压缩。该解决方案不如之前的研究高效，并且仍然需要在编码和测试方面付出相当大的努力。

python - 压缩大量相似的字符串

4 回答 4

Related

Reference