regex - 如何查找包含文件中给定整数的行？

Question

我有一个dict每行包含一个整数的文件

123
456

我想在文件中找到完全file包含.dict

如果我使用

$ grep -w -f dict file

我得到错误的匹配，例如

12345  foo
23456  bar

这些是错误的，因为12345 != 123和23456 != 456。问题是该-w选项也将数字视为单词字符。该-x选项也不起作用，因为其中的行file可以有其他文本。请问最好的方法是什么？如果该解决方案能够在大尺寸上提供进度监控dict和良好的性能，那就太好了。file

score 2 · Accepted Answer

将单词边界添加到 dict 中，如下所示：

\<123\>
\<456\>

-w 参数不是必需的。只需要：

grep -f 字典文件

score 1 · Accepted Answer

您可以使用 Python 脚本轻松完成此操作，例如：

import sys

numbers = set(open(sys.argv[1]).read().split("\n"))
with open(sys.argv[2]) as inf:
    for s in inf:
        if s.split()[0] in numbers:
            sys.stdout.write(s)

错误检查和恢复留给读者实施。

score 1 · Accepted Answer

一种相当通用的方法，使用awk：

awk 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) print $0 }' dict file

解释：

FNR==NR { }  ## FNR is number of records relative to the current input file. 
             ## NR is the total number of records.
             ## So this statement simply means `while we're reading the 1st file
             ## called dict; do ...`

array[$1]++; ## Add the first column ($1) to an array called `array`.
             ## I could use $0 (the whole line) here, but since you have said
             ## that there will only be one integer per line, I decided to use
             ## $1 (it strips leading and lagging whitespace; if any)

next         ## process the next line in `dict`

for (i=1; i<=NF; i++)  ## loop through each column in `file`

if ($i in array)       ## if one of these columns can be found in the array

print $0               ## print the whole line out

使用 bash 循环处理多个文件：

## This will process files; like file, file1, file2, file3 ...
## And create output files like, file.out, file1.out, file2.out, file3.out ...

for j in file*; do awk -v FILE=$j.out 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) print $0 > FILE }' dict $j; done

如果您有兴趣在tee多个文件上使用，您可能想尝试这样的事情：

for j in file*; do awk -v FILE=$j.out 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) { print $0 > FILE; print FILENAME, $0 } }' dict $j; done 2>&1 | tee output

这将向您显示正在处理的文件的名称和找到的匹配记录，并将“日志”写入名为output.

regex - 如何查找包含文件中给定整数的行？

3 回答 3

Related

Reference