This is a variation or extension of my previous answer. The primary difference is that it's able to "compress" the strings potentially much more by taking advantage of the fact that they're actually file paths with possibly one or more common sub-components. This redundancy is eliminated by building a tree data structure from this information meaning that for any given level, each unique component value is only stored once. The tree data structure itself is implemented using what is essentially a dictionary of dictionaries. Its creation ought to be relatively fast due to it use of the built-in reduce()
function.
The updated version below now has the property that -- unlike with previous one -- what's created can be pickled, which means it can be saved and read back intact later from a file.
Like my other answer, this isn't doing what you call "real" compression, however I think doing something like might be overkill for what you're trying to accomplish since this would solve your problem, is relatively quick, plus would be much simpler to maintain, enhance, and extend later.
import collections
import cPickle as pickle # use C version for performance
import operator
import os
class DefaultDict(collections.defaultdict):
""" pickle-able defaultdict """
def __reduce__(self):
args = (self.default_factory,) if self.default_factory else tuple()
return type(self), args, None, None, self.iteritems()
def Tree(): return DefaultDict(Tree)
class Leaf(object): pass
def print_tree(d, level=0, indent=' '*4):
""" Tree structure pretty printer """
if not d:
print indent * level + '<empty>'
else:
for key, value in sorted(d.iteritems()):
print indent * level + str(key)
if isinstance(value, dict):
print_tree(value, level+1, indent)
elif not isinstance(value, Leaf):
print indent * (level+1) + str(value)
# create Tree structure from paths
tree = Tree()
LEAF = Leaf()
with open('log_file_paths.txt') as inf:
for line in inf:
path = line.strip().split(os.sep)
reduce(operator.getitem, path[:-1], tree)[path[-1]] = LEAF
print_tree(tree)
# save tree to a file
with open('file_tree.pk', 'wb') as outf:
pickle.dump(tree, outf, pickle.HIGHEST_PROTOCOL)
# try reading it back in
del tree
with open('file_tree.pk', 'rb') as inf:
tree = pickle.load(inf)
print_tree(tree) # display reconstituted tree
So if the input file consisted of these file paths:
tests/first/set1.log
tests/first/set2.log
tests/first/subfirst/set3.log
tests/first/subfirst/set4.log
tests/second/set5.log
tests/second/subsecond/set6.log
tests/second/subsecond/subsubsecond/set7.log
tests/third/set8.log
The following tree structure is what gets created in memory and displayed (and later written, read back, and redisplayed) to hold them:
tests
first
set1.log
set2.log
subfirst
set3.log
set4.log
second
set5.log
subsecond
set6.log
subsubsecond
set7.log
third
set8.log