0

I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).

Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.

First I used grep with 1 parameter, then LC_ALL=C made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel

parallel -j 2 --pipe --block 20M

and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.

BUT.. when I pipe multiple parameters in one grep

parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt

then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).

Is there a faster/better way to solve this problem?

Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.

4

2 回答 2

1

为了反映上述评论,我做了另一项测试。从命令中获取我的文件md5deep -rZ(大小:319MB)。随机选择 100 个 md5 校验和(每个 32 个字符长)。

time egrep '100|fixed|strings' md5 >/dev/null

时间

real    0m16.888s
user    0m16.714s
sys     0m0.172s

为了

time fgrep -f 100_lines_patt_file md5 >/dev/null

现在的时间是

real    0m1.379s
user    0m1.220s
sys     0m0.158s

几乎是 egrep 的 15 倍。

因此,当您在恕我直言之间仅获得 0.3 秒的改进时egrep,这意味着:fgrep

  • 你的 IO 会变慢

egrep 的计算时间不会因处理器或内存而减慢,而是因 IO 和(恕我直言)而变慢,因此使用fgrep.

于 2013-07-04T18:16:43.627 回答
0

有趣的是,将日志文件压缩为 .gz 格式并使用 zgrep -E 大大减少了时间。此外,我在单个 zgrep 命令中搜索 1 个模式还是多个模式都没关系,它只在每个 200MB 文件大约 1 秒左右工作。

于 2013-07-05T17:39:33.960 回答