linux - grep 针对大文件的大列表

Question

我目前正在尝试grep针对更大的 csv 文件（3.000.000 行）列出大量 ids（~5000）。

我想要所有包含 id 文件中的 id 的 csv 行。

我天真的方法是：

cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done

但这需要永远！

有没有更有效的方法来解决这个问题？

score 50 · Accepted Answer

尝试

grep -f the_ids.txt huge.csv

此外，由于您的模式似乎是固定字符串，因此提供该-F选项可能会加快grep.

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

score 23 · Accepted Answer

用于grep -f此：

grep -f the_ids.txt huge.csv > output_file

来自man grep：

-f 文件，--文件=文件

从 FILE 中获取模式，每行一个。空文件包含零个模式，因此不匹配任何内容。（-f 由 POSIX 指定。）

如果您提供一些示例输入，也许我们甚至可以grep进一步改善条件。

测试

$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye

$ grep -f ids huge.csv 
hello this is 11 but
and here 23

score 11 · Accepted Answer

grep -f filter.txt data.txtfilter.txt当大于几千行时变得不守规矩，因此不是这种情况的最佳选择。即使在使用时grep -f，我们也需要牢记以下几点：

-x如果需要匹配第二个文件中的整行，请使用选项
-F如果第一个文件有字符串，而不是模式，则使用
用于在不使用该选项-w时防止部分匹配-x

这篇文章对此主题进行了很好的讨论（grep -f关于大文件）：

在 Bash 中从另一个较大文件中查找文件行的最快方法

这篇文章谈到grep -vf：

grep -vf 对于大文件太慢

总之，处理grep -f大文件的最佳方法是：

匹配整行：

awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt

匹配第二个文件中的特定字段（在此示例中使用 ',' 分隔符和字段 2）：

awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt

和grep -vf：

匹配整行：

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt

匹配第二个文件中的特定字段（在此示例中使用 ',' 分隔符和字段 2）：

awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt

score 0 · Accepted Answer

使用ugrep可以显着提高搜索速度，以匹配the_ids.txt大huge.csv文件中的字符串：

ugrep -F -f the_ids.txt huge.csv

这也适用于 GNU grep，但我希望 ugrep 运行速度快几倍。

linux - grep 针对大文件的大列表

4 回答 4

测试

Related

Reference