我试图理解以下代码用于使用 BASH 在多个文件中提取重叠行。
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
在了解每个子代码块在做什么之后,我想扩展此代码以查找重叠的特定字段,而不是整行。例如,我尝试更改行:
n = split(rec[R], t, "/")
至
n = split(rec[R$1], t, "/")
找到所有文件中第一个字段相同的行,但这不起作用。最终我想扩展它以检查一行是否具有相同的字段 1、2 和 4,然后打印该行。
具体来说,对于链接中示例中提到的文件:如果文件 1 是:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
文件2是:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
我想退出:
file1/file2 --> chr1 33251518 AK2 H191D
我在以下链接中找到了这段代码: http ://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738 。具体来说,我想从文件本身中了解 R、rec、n、dup 和 D 代表什么。从提供的评论中不清楚,我在子循环中添加的 printf 语句失败。
非常感谢您对此的任何见解!