3

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?

Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?

More Info/Edit: I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

4

2 回答 2

6

你必须使用bzip2吗?阅读它的文档,很明显它不是为支持随机访问而设计的。也许您应该使用更符合您要求的压缩格式。良好的旧 Zip 格式支持随机访问,但当然可能压缩得更糟。

于 2010-08-16T14:30:03.003 回答
0

Bzip2 压缩成大块(我相信默认为 900 KiB)。一种可以显着加快 tar 文件扫描会降低压缩性能的方法是单独压缩每个文件,然后将结果一起 tar。这本质上就是 Zip 格式的文件(尽管使用 zlib 压缩而不是 bzip2)。但是您随后可以轻松获取 tar 索引,并且只需要解压缩您要查找的特定文件。

我不认为大多数tar程序都提供了以任何有意义的方式组织文件的能力,尽管您可以为您的特殊情况编写一个程序来执行此操作(我知道 Python 有 tar 编写库,尽管我只使用过一次或两次)。但是,您仍然会遇到在找到所需内容之前必须解压缩大部分数据的问题。

于 2010-08-16T14:31:28.677 回答