unix - 基于 2 列共同正确连接两个文件

Question

我有两个文件我试图加入/合并基于列1和2. 它们看起来像这样，file1（58210行）比file2（815530行）短得多，我想根据字段1和2索引找到这两个文件的交集：

file1：

2L      25753   33158
2L      28813   33158
2L      31003   33158
2L      31077   33161
2L      31279   33161
3L      32124   45339
3L      33256   45339
...

file2：

2L      20242   0.5     0.307692307692308
2L      22141   0.32258064516129        0.692307692307692
2L      24439   0.413793103448276       0.625
2L      24710   0.371428571428571       0.631578947368421
2L      25753   0.967741935483871       0.869565217391304
2L      28813   0.181818181818182       0.692307692307692
2L      31003   0.36    0.666666666666667
2L      31077   0.611111111111111       0.931034482758621
2L      31279   0.75    1
3L      32124   0.558823529411765       0.857142857142857
3L      33256   0.769230769230769       0.90625
...

我一直在使用以下几个命令，但最终得到的行数不同：

awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file1 file2 | wc -l
awk 'FNR==NR{a[$1$2]=$3;next} {if($1$2 in a) print}' file2 file1 | wc -l

我不确定为什么会发生这种情况，我已经尝试在比较之前进行排序，以防我在两个文件中都有重复的行（基于列1和2），但这似乎没有帮助。（任何关于为什么会这样的见解也值得赞赏）

我怎样才能合并文件，以便只有那些行file2具有相应的列1并打印出来，添加列2，看起来像这样：file13file1

2L      25753   0.967741935483871       0.869565217391304    33158
2L      28813   0.181818181818182       0.692307692307692    33158
2L      31003   0.36    0.666666666666667    33158
2L      31077   0.611111111111111       0.931034482758621    33161
2L      31279   0.75    1    33161
3L      32124   0.558823529411765       0.857142857142857    45339
3L      33256   0.769230769230769       0.90625    45339

score 31 · Accepted Answer

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

看：

$ cat file1
2L      5753   33158
2L      8813   33158
2L      7885   33159
2L      1279   33159
2L      5095   33158
$
$ cat file2
2L      8813    0.6    1.2
2L      5762    0.4    0.5
2L      1279    0.5    0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L      8813    0.6    1.2 33158
2L      1279    0.5    0.9 33159
$

如果这不是您想要的，请澄清并发布一些更具代表性的示例输入/输出。

上述代码的注释版本以提供要求的解释：

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

希望有帮助。

score 6 · Accepted Answer

如果您想逐行加入文件，请使用以下命令：

join -o 1.2,1.3,2.4,2.5,1.4 <(cat -n file1) <(cat -n file2)

当您更新问题时：

join -o 1.1,2.2,2.3,1.2 <(sed 's/[[:space:]]\+/@/' file1|sort) \
    <(sed 's/[[:space:]]\+/@/' file2|sort)|sed 's/@/\t/'

首先用一些非空格字符替换每行中的第一个分隔符并对两个输入文件进行排序。然后用于join进行实际连接。过滤掉它的输出，用空格替换非空格字符。

这是有问题的文件的输出：

xyz]$ join -o 1.1,2.2,2.3,1.2 <(sed 's/[[:space:]]\+/@/' file1|sort) \
<(sed 's/[[:space:]]\+/@/' file2|sort)|sed 's/@/\t/'

2L  25753 0.967741935483871 0.869565217391304 33158
2L  28813 0.181818181818182 0.692307692307692 33158
2L  31003 0.36 0.666666666666667 33158
2L  31077 0.611111111111111 0.931034482758621 33161
2L  31279 0.75 1 33161
3L  32124 0.558823529411765 0.857142857142857 45339
3L  33256 0.769230769230769 0.90625 45339

score 1 · Accepted Answer

您可以使用该join命令，但您需要在每个数据表中创建一个连接字段。假设您2L在第 1 列中确实有其他值，那么无论两个输入文件的排序或未排序性质如何，此代码都应该工作：

tmp=${TMPDIR:-/tmp}/tmp.$$
trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15

awk '{print $1 ":" $2, $0}' file1 | sort > $tmp.1
awk '{print $1 ":" $2, $0}' file2 | sort > $tmp.2

join -o 2.2,2.3,2.4,2.5,1.4 $tmp.1 $tmp.2

rm -f $tmp.?
trap 0

如果您有bash“处理替换”，或者如果您知道数据已经正确排序，则可以简化处理。

我不完全确定您的代码为什么不起作用，但我可能会使用a[$1,$2]下标；如果您的第 1 列中的某些值是纯数字，那么它会给您带来更少的麻烦，因此当您连接第 1 列和第 2 列时可能会感到困惑。这就是“密钥创建”awk脚本在字段之间使用冒号的原因。

修改后的数据文件如图所示：

文件 1

2L      5753   33158
2L      8813   33158
2L      7885   33158
2L      7885   33159
2L      1279   33158
2L      5095   33158
2L      3256   33158
2L      5372   33158
2L      7088   33161
2L      5762   33161

文件2

2L      5095    0.666666666666667       1
2L      5372    0.5     0.925925925925926
2L      5762    0.434782608695652       0.580645161290323
2L      5904    0.571428571428571       0.869565217391304
2L      5974    0.434782608695652       0.694444444444444
2L      6353    0.785714285714286       0.84
2L      7088    0.590909090909091       0.733333333333333
2L      7885    0.714285714285714       0.864864864864865
2L      7902    0.642857142857143       0.810810810810811
2L      8263    0.833333333333333       0.787878787878788

（与问题相同。）

输出

2L 5095 0.666666666666667 1 33158
2L 5372 0.5 0.925925925925926 33158
2L 5762 0.434782608695652 0.580645161290323 33161
2L 7088 0.590909090909091 0.733333333333333 33161
2L 7885 0.714285714285714 0.864864864864865 33158
2L 7885 0.714285714285714 0.864864864864865 33159

unix - 基于 2 列共同正确连接两个文件

3 回答 3

文件 1

文件2

输出

Related

Reference