bash - 从多个文件中获取仅针对特定字段的通用行

Question

我试图理解以下代码用于使用 BASH 在多个文件中提取重叠行。

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' file[a-d]

在了解每个子代码块在做什么之后，我想扩展此代码以查找重叠的特定字段，而不是整行。例如，我尝试更改行：

n = split(rec[R], t, "/")

至

n = split(rec[R$1], t, "/")

找到所有文件中第一个字段相同的行，但这不起作用。最终我想扩展它以检查一行是否具有相同的字段 1、2 和 4，然后打印该行。

具体来说，对于链接中示例中提到的文件：如果文件 1 是：

chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D

文件2是：

chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

我想退出：

file1/file2 --> chr1    33251518    AK2    H191D

我在以下链接中找到了这段代码： http ://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738 。具体来说，我想从文件本身中了解 R、rec、n、dup 和 D 代表什么。从提供的评论中不清楚，我在子循环中添加的 printf 语句失败。

非常感谢您对此的任何见解！

score 2 · Accepted Answer

该脚本通过构建一个辅助数组来工作，该数组的索引是输入文件中的行（由$0in表示rec[$0]），并且值是存在filename1/filename3/...给定行的那些文件名$0。您可以将其破解为仅与一起使用，$1如下所示：$2$4

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) {
        split(R,R1R2R4,SUBSEP)
        dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
          sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
      }
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the partial content of the current record
  # (special concatenation of $1, $2 and $4)
  # concatenating the filenames separated by / as values
  rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
  }' file[a-d]

此解决方案使用多维数组：我们创建rec[$1,$2,$4]而不是rec[$0]. 这种特殊的语法awk将索引与SUBSEP字符连接起来，默认情况下是不可打印的（准确地说是“\034”），因此它不太可能是任何一个字段的一部分。实际上它确实如此rec[$1 SUBSEP $2 SUBSEP $4]=...。否则这部分代码是一样的。请注意，将第二个块移动到脚本的开头并以该END块结束会更合乎逻辑。

代码的第一部分也必须更改：现在for (R in rec)循环这些棘手的连接索引，$1 SUBSEP $2 SUBSEP $4. 这在索引时很好，但是您需要split R在SUBSEP字符处再次获取可打印字段$1, $2, $4. 这些被放入数组R1R2R4中，可以用来打印必要的输出：而不是%s,...,R我们现在有%s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. 实际上，我们正在sprintf ...%s,...,$1,$2,$4;使用预先保存的字段$1, $2, $4. 对于您的输入示例，这将打印

records found in 2 files:

    foo11.inp1/foo11.inp2 -->   chr1    33251518    AK2

请注意，缺少输出，H191D但正确的是：不在字段 1、2 或 4 中（而是在字段 5 中），因此不能保证在打印文件中它是相同的！您可能不想打印它，或者无论如何必须指定如何处理未在文件之间检查的列（因此可能会有所不同）。

对原始代码的一点解释：

rec是一个数组，其索引是输入的完整行，值是这些行出现的文件的斜线分隔列表。例如，如果file1包含一行“ foo bar”，则rec["foo bar"]=="file1"最初。如果 thenfile2也包含这一行，则rec["foo bar"]=="file1/file2". 请注意，没有检查多重性，因此如果file1包含此行两次，那么最终您将获得rec["foo bar"]=file1/file1/file2并获得包含此行的文件数为 3。
Rrec完全构建后遍历数组的索引。这意味着R最终将假定每个输入文件的每个唯一行，允许我们循环rec[R]，包含该特定行R所在的文件名。
n是 from 的返回值，它在每个斜杠处split拆分rec[R]--- 的值，即与行 --- 对应的文件名列表。R最终数组t被文件列表填充，但我们不使用它，我们只使用数组的长度t，即R存在行的文件数（保存在变量中n）。如果n==1，我们什么都不做，只有当存在多重性时。
循环n根据给定行的多重性创建类。n==2适用于恰好出现在 2 个文件中的行。n==3对于那些出现三次的人，依此类推。这个循环的作用是它构建一个数组dup，它为每个多重类（即每个n）创建输出字符串"filename1/filename2/... --> R"，其中每个字符串由RS（记录分隔符）分隔，每个值在文件R中出现n的时间总计。所以最终dup[n]对于一个给定的n将包含给定数量的字符串形式"filename1/filename2/... --> R"，与RS字符连接（默认为换行符）。
然后循环将遍历多重类（即大于 1D in dup的有效值），并打印每个. 由于我们只定义了 for ，如果有多重性，则从 2 开始（或者，如果没有多重性，则 then为空，并且循环不会做任何事情）。ndup[D]Ddup[n]n>1DdupD

score 1 · Accepted Answer

首先，您需要了解 AWK 脚本中的 3 个块：

BEGIN{
# A code that is executed once before the data processing start
}

{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}

END{
# A code that is executed once after all data processing finished
}

所以你可能需要编辑这部分脚本：

  {  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }

bash - 从多个文件中获取仅针对特定字段的通用行

2 回答 2

Related

Reference