bash - Grep multiple strings on large files

Question

I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).

Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.

First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel

parallel -j 2 --pipe --block 20M

and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.

BUT.. when I pipe multiple parameters in one grep

parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt

then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).

Is there a faster/better way to solve this problem?

Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.

score 1 · Accepted Answer

为了反映上述评论，我做了另一项测试。从命令中获取我的文件md5deep -rZ（大小：319MB）。随机选择 100 个 md5 校验和（每个 32 个字符长）。

这

time egrep '100|fixed|strings' md5 >/dev/null

时间

real    0m16.888s
user    0m16.714s
sys     0m0.172s

为了

time fgrep -f 100_lines_patt_file md5 >/dev/null

现在的时间是

real    0m1.379s
user    0m1.220s
sys     0m0.158s

几乎是 egrep 的 15 倍。

因此，当您在恕我直言之间仅获得 0.3 秒的改进时egrep，这意味着：fgrep

你的 IO 会变慢

egrep 的计算时间不会因处理器或内存而减慢，而是因 IO 和（恕我直言）而变慢，因此使用fgrep.

score 0 · Accepted Answer

有趣的是，将日志文件压缩为 .gz 格式并使用 zgrep -E 大大减少了时间。此外，我在单个 zgrep 命令中搜索 1 个模式还是多个模式都没关系，它只在每个 200MB 文件大约 1 秒左右工作。

bash - Grep multiple strings on large files

2 回答 2

Related

Reference