search - 多个字符串多次出现的递归搜索

Question

本题是上一题的方向延伸。我的搜索要求如下

需要搜索的多个字符串存储在文件 values.txt（输入文件）中，例如包含如下信息

string1  1
string2  3
string3  5

其中第一列 (string1, string2, string3) 表示需要搜索的字符串，而第二列表示要搜索的出现次数。
此外，需要对具有特定文件扩展名（例如 .out、.txt 等）的文件进行递归搜索
搜索输出应定向到一个文件，其中搜索的输出与文件名及其路径一起打印。

例如，典型输出必须如下所示（用于递归搜索扩展名为 .out 的文件名）

<path_of_searched_file1/fileName1.out>
The full line containing the <first> instance of <string1>
The full line containing the <first> instance of <string2>
The full line containing the <second> instance of <string2>
The full line containing the <third> instance of <string2>
The full line containing the <first> instance of <string3>
The full line containing the <second> instance of <string3>
The full line containing the <third> instance of <string3>
The full line containing the <fourth> instance of <string3>
The full line containing the <fifth> instance of <string3>


<path_of_searched_file2/fileName2.out>
The full line containing the <first> instance of <string1>
The full line containing the <first> instance of <string2>
The full line containing the <second> instance of <string2>
The full line containing the <third> instance of <string2>
The full line containing the <first> instance of <string3>
The full line containing the <second> instance of <string3>
The full line containing the <third> instance of <string3>
The full line containing the <fourth> instance of <string3>
The full line containing the <fifth> instance of <string3>


and so on

使用 awk 是解决此搜索问题的最佳方法吗？如果是这样，有人可以帮助我修改上一个问题中提供的 awk 代码以满足我当前的搜索要求。

score 1 · Accepted Answer

这是一种使用方法awk；YMMV。像这样运行：

awk -f ./script.awk values.file $(find . -type f -regex ".*\.\(txt\|doc\|etc\)$")

内容script.awk：

FNR==NR {
    a[$1]=$2;
    next
}

FNR==1 {
    for (i in a) {
        b[i]=a[i]
    }
}

{
    for (j in b) {
        if ($0 ~ j && b[j]-- > 0) {
            print > FILENAME ".out"
        }
    }
}

或者，这是单线：

awk 'FNR==NR { a[$1]=$2; next } FNR==1 { for (i in a) b[i]=a[i] } { for (j in b) if ($0 ~ j && b[j]-- > 0) print > FILENAME ".out" }' values.file $(find . -type f -regex ".*\.\(txt\|doc\)$")

解释：

在第一个块中，创建一个关联数组，其中第一列values.file作为键，第二列作为值。第二个和第三个块读取使用该find命令找到的文件。在第一个块中形成的数组是重复的（使用没有简单的方法来做到这一点awk；所以也许 Perl 和Find::File::Rule模块会是更好的选择？）对于找到的每个文件。在第三个块中，我们循环遍历每个键以搜索字符串并减少它的值，打印到带有“.out”扩展名的文件位置。

search - 多个字符串多次出现的递归搜索

1 回答 1

Related

Reference