awk - 匹配来自多个输入文件的条目

Question

我的FileA内容是：

 LetterA     LetterM  12
 LetterB     LetterC  45
 LetterB     LetterG  23

FileB内容是：

 LetterA    23   43    LetterZ
 LetterB    21   71    LetterC

我想写原始FileA条目加上$2-$3ifFileB的条目
FileA $1 = FileB $1 && FileA $2 = FileB $4。
对于这样的输出：

 LetterB     LetterC  45   -50

我可以使用 bash 循环来做到这一点

 while read ENTRY
 do
    COLUMN1=$(cut -f 1 $ENTRY)
    COLUMN2=$(cut -f 2 $ENTRY)
    awk -v COLUMN1="$COLUMN1" -v COLUMN2="COLUMN2" -v ENTRY="$ENTRY"   
         '($1==COLUMN1 && $4==COLUMN2) 
         {print ENTRY,$2-$3}' FileB
 done < FileA

但是，这个循环太慢了。有没有办法在不循环的情况下使用 awk 来做到这一点？
获取多个输入文件 -> 匹配它们的内容 -> 打印想要的输出。

score 3 · Accepted Answer

可以在 awk one-liner 中解决：

awk 'NR==FNR{a[$1":"$2]=$0; next}
     NR>FNR && $1":"$4 in a{print a[$1":"$4], $2-$3}' fileA fileB

或者更简洁（感谢@JS웃）：

awk 'NR==FNR{a[$1$2]=$0;next}$1$4 in a{print a[$1$4],$2-$3}' file{A,B}

score 1 · Accepted Answer

我决定尝试使用 Python 和 Numpy 来获得一个稍微不正统但希望快速的解决方案：

import numpy as np

# load the files into arrays with automatically determined types per column
a = np.genfromtxt("fileA", dtype=None)
b = np.genfromtxt("fileB", dtype=None)

# concatenate the string columns (n.b. assumes no "foo" "bar" and "fo" "obar")
aText = np.core.defchararray.add(a['f0'], a['f1'])
bText = np.core.defchararray.add(b['f0'], b['f3'])

# find the locations where the strings from A match in B, and print the values
for index in np.where(np.in1d(aText, bText)):
    aRow = a[index][0]
    bRow = b[bText == aText[index]][0]
    print '{1} {2} {3} {0}'.format(bRow[1] - bRow[2], *aRow)

编辑：一旦开始它就会很快，但不幸的是，加载文件所花费的时间比@anubhava 使用 awk 的出色解决方案要长。

awk - 匹配来自多个输入文件的条目

2 回答 2

Related

Reference