我需要存储万亿个 URL 列表,其中每个 URL 列表将包含约 50 个 URL。将它们压缩以进行磁盘存储的最节省空间的方法是什么。
我正在考虑首先删除像“http://”这样的无用信息,然后构建一个最小的有限状态自动机并保存它。
另一种选择是构建一个逗号分隔的 URL 字符串,并使用常规压缩(如 GZIP 或 BZ2)压缩此字符串。
如果我不关心速度,哪种解决方案会产生最佳压缩。
我需要存储万亿个 URL 列表,其中每个 URL 列表将包含约 50 个 URL。将它们压缩以进行磁盘存储的最节省空间的方法是什么。
我正在考虑首先删除像“http://”这样的无用信息,然后构建一个最小的有限状态自动机并保存它。
另一种选择是构建一个逗号分隔的 URL 字符串,并使用常规压缩(如 GZIP 或 BZ2)压缩此字符串。
如果我不关心速度,哪种解决方案会产生最佳压缩。
Given the amount of URLs and the fact that most of them use more or less the same structures and naming patters, I would go with using an index and a tokenizer. First use a tokenizer to gather as many words as possible and save them in an index. You can then replace each token by its index in the list:
http://www.google.com/search?q=hello+world (42 bytes)== would give you
http:// => 1 www. => 2 google.com => 3 search => 4 hello => 5 world => 6
and the URL will become: 1,2,3,'/',4,'?','q','=', 5,'+',6
Given the fact that a lot of URLs will be subdomains of a common big domain and that most of them will use the same common English words (think of all the about us pages or careers...), you will probably end up with a not so big index (there is about 50000 usual words in english, 70000 in french).
You can then compress the index and the tokenized URLs to gain even more space.
There are O(n) and O(nlogn) algorithms for parsing the URLs and building the index.
经过调查,似乎只使用 GZIP 压缩比只使用紧凑有向无环词图更好!