I have a large number of large log files (each log file is around 200mb and I have 200GB data in total).
Every 10 minutes, server writes to the log file about 10K parameters (with a timestamp). Out of each 10K parameters, I want to extract 100 of them to a new file.
First I used grep with 1 parameter, then LC_ALL=C
made it a little bit faster, then I used fgrep it was also slightly faster. Then I used parallel
parallel -j 2 --pipe --block 20M
and finally, for every 200MB, I was able to extract 1 parameter in 5 seconds.
BUT.. when I pipe multiple parameters in one grep
parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt
then the time for grep operation increased linearly (it takes quite bit of minutes to grep 1 file now). (Note that I had to use egrep for multiple pipes, somehow grep didn't like them).
Is there a faster/better way to solve this problem?
Note that I don't need to use regex, because the patterns I am looking for are fixed. I just want to extract certain lines that includes a particular string.