perl - Joining two files based on two fields

Question

I posted a question before a week and the answer was simply (use join):

join <(sort file1) <(sort file2) >output

to join files that have something common which is usually the first field.

I have the following two files:

genes.txt

ENSG001 ENSG002
ENSG002 ENSG001
ENSG003 ENSG004

features.txt

ENSG001 400
ENSG002 350
ENSG003 210
ENSG004 100

I need to join these two files to be like this:

output.txt

ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100

I know the answer is in join command but I can't figure out how to join based on two fields. I tried

join -j 1 <(sort genes.txt) <(sort features.txt) >attempt1.txt

but the result will looks like this:

attempt1.txt

ENSG001 ENSG002 400
ENSG002 ENSG001 350
ENSG003 ENSG004 210

I then tried

join -j 2 <(sort -k 2 genes.txt) <(sort -k 2 features.txt) >attempt2.txt

attempt2.txt is empty

Does (join) have the ability to join two files based on two fields ? If no then how can I do it ?

score 3 · Accepted Answer

%features;
open $fd, '<', 'features.txt' or die $!;
while (<$fd>) {
    ($k, $v) = split;
    $features{$k} = $v;
}
close $fd or die $!;

open $fd, '<', 'genes.txt' or die $!;
while (<$fd>) {
    s/(\w+)/$1 $features{$1}/g;
    print;
}
close $fd or die $!;

score 3 · Accepted Answer

据我所知，join 不支持这一点。见join manpage。

但是，您可以通过 2 种方式完成此操作：

将文件中的第一个空格/制表符转换为插入符号（或您在文件中永远不会看到的其他字符），然后像以前一样使用 join 它将前 2 个字段视为 1 个字段：
```
perl -pi -e 's/^(\S+)\s+/$1#/' file1
perl -pi -e 's/^(\S+)\s+/$1#/' file2
join <(sort file1) <(sort file2) >output
tr "#" " " output > output.final
```
在 Perl 中进行。你可以做
- 直率的方法（perreal 的回答：一次在 2 个文件中啜饮）；如果两个文件都很大，这会占用大量内存
- 更节省内存的方法（cdtits 的回答：在一个较小的文件中啜饮，存储在哈希中，然后将查找应用于第二个文件的逐行读取）
- 对于非常庞大的文件，请采用线性方法：
  
  对两个文件进行排序，读取每个文件的 1 行；如果它们匹配，则打印匹配；如果不; 跳过 ID 较小的文件中的 1 行。

score 3 · Accepted Answer

谢谢大家，我已经设法通过欺骗问题来回答它。

首先我正常加入文件，然后我改变了第一个和第二个字段的位置，然后我再次加入了修改后的输出文件与特征，最后我再次切换了字段的位置。

join <(sort genes.txt) <(sort features.txt) >tmp

cat tmp | awk '{ print $2, $1, $3 }' >tmp2

join <(sort tmp2) <(sort features.txt) >tmp3

cat tmp3 | awk '{ print $2, $3, $1, $4 }' >output.txt

score 1 · Accepted Answer

您的方法通常是正确的。它应该可以通过类似的东西来实现

join -o '1.1 2.2 1.2 1.3' <(
    join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
    sort
) <(sort features.txt)

如果我放置ENSG004而不是ENST004放入，features.txt我将得到您正在寻找的东西：

$ join -o '1.1 2.2 1.2 1.3' <(
      join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
      sort
  ) <(sort features.txt)
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100

没有那么冗长的版本，但更难跟踪字段：

join -o '1.2 2.2 1.1 1.3' -1 2 <(
    join -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
    sort -k 2
) <(sort features.txt)

features.txt如果您要处理非常大的数据，它应该对数十 GB 的数据非常有效（如果并且genes.txt大小相当的话，也应该比大多数 RDBMS 更好）：

TMP=`mktemp`
sort features.txt > "$TMP"
sort -k 2 genes.txt | join -o '1.1 1.2 2.2' -1 2 - "$TMP" | sort |
    join -o '1.1 2.2 1.2 1.3' - "$TMP"
rm "$TMP"

score 1 · Accepted Answer

如果 features.txt 中的“ENST”是“ENSG”，这里有一个适用于给定示例的awk解决方案：

awk 'BEGIN {while(getline <"features.txt") f[$1]=$2} {print $1,f[$1],$2,f[$2]}' < genes.txt

如果需要，我可以详细解释。

score 1 · Accepted Answer

使用 perl：

use strict;
use warnings;
open GIN, "<genes.txt"    or die("genes");
open FIN, "<features.txt" or die("features");
my %relations;
my %values;
while (<GIN>) {
  my ($r1, $r2) = split;
  $relations{$r1} = $r2;
}
while (<FIN>) {
  my ($k, $v) = split;
  $values{$k} = $v;
}
for my $r1 (sort keys %relations) {
  my $r2 = $relations{$r1};
  print "$r1 $values{$r1} $r2 $values{$r2}\n"; 
}
close FIN; close GIN;

perl - Joining two files based on two fields

6 回答 6

Related

Reference