0

Very often I've to deal with very large binary files (from 50 to 500Gb), in different formats, which contains basically mixed data including strings.

I need to index the strings inside the file, creating a database or an index, so I can do quick searches (basic search or complex with regex). The output of the search should be of course the offset of the found string in the binary file.

Does anyone know a tool, framework or library which can help me on this task?

4

1 回答 1

0

您可以在其上运行“strings -t d”(Linux / OS X)以提取具有相应偏移量的字符串,然后将其放入 Solr 或 Elastic。如果您想要的不仅仅是 ASCII,它会变得更加复杂。

Autopsy有自己的字符串提取代码(用于 UTF-8 和 UTF-16)并将其放入 Solr(如果支持文件格式,则使用 Tika),但它不记录二进制文件的偏移量,因此它可能不能满足您的需求。

于 2016-10-28T18:03:41.227 回答