linux - 我如何并行grep

Question

我通常用grep -rIn pattern_str big_source_code_dir来找东西。但grep不是平行的，我该如何让它平行？我的系统有4个核心，如果grep可以使用所有核心，那就更快了。

score 12 · Accepted Answer

如果您使用 HDD 存储您正在搜索的目录，则不会提高速度。硬盘驱动器几乎是单线程访问单元。

但是如果你真的想做并行 grep，那么这个网站给出了两个提示，告诉你如何用find和来做xargs。例如

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar

score 5 · Accepted Answer

GNUparallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems

然后，paralell手册页提供了一个示例：

EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.

在您的情况下，它可能是：

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}

最后，GNU 并行手册页还提供了描述xargs与parallel命令之间差异的部分，这应该有助于理解为什么并行在您的情况下看起来更好

DIFFERENCES BETWEEN xargs AND GNU Parallel
       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.
       ...

score 3 · Accepted Answer

请注意，您需要转义并行 grep 搜索词中的特殊字符，例如：

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

使用独立的 grep，grep -F 'PostTypeId="1"'可以在不转义双引号的情况下工作。我花了一段时间才弄清楚！

还要注意使用LC_ALL=C和-F标志（如果您只是搜索完整的字符串）以获得额外的加速。

score 0 · Accepted Answer

这里有 3 种方法可以做到这一点，但您无法获得其中两种方法的行号。

(1) 对多个文件并行运行 grep，在这种情况下是一个目录及其子目录中的所有文件。添加/dev/null强制 grep 将文件名添加到匹配行，因为你会想知道匹配的文件。调整-P机器的进程数。

find . -type f | xargs -n 1 -P 4 grep -n <grep-args> /dev/null

(2) 对多个文件串行运行 grep 但并行处理 10M 块。调整您的机器和文件的块大小。这里有两种方法可以做到这一点。

# for-loop
for filename in `find . -type f`
do 
  parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"
done

# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

(3)结合(1)和(2)：对多个文件并行运行grep，并以块的形式并行处理它们的内容。为您的机器调整块大小和 xargs 并行度。

find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

请注意，（3）可能不是资源的最佳利用方式。

我有一个更长的文章，但这是基本的想法。

linux - 我如何并行grep

4 回答 4

Related

Reference