0

基本上,我需要一个可以在很短的时间内解决问题的脚本。我有两个文件:

$头-n 6 fcu.tsv

NM576455     0.324009324     0.578896174     2577
NM539570     0.204545455     0.607877092     2247
NM337132     0.288973384     0.673636364     792
NM374379     0.308300395     0.42            762
NM373443     0.263043478     0.547132867     1383
NM371839     0.298210736     0.492857143     1512

$ 头 -n 6 集市.tsv

NM539570 ILMN_2199362    15      58.52   protein_coding
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1

我们需要在很短的时间内为每个 NM id 将 fcu.tsv 的第 2、3 和 4 个字段附加到 mart.tsv。

$ head out.tsv

NM539570 ILMN_2199362    15      58.52   protein_coding  0.204545455     0.607877092     2247
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1 0.324009324   0.578896174     2577

这就是我在 matlab 中所做的(我更喜欢该解决方案在此处修复错误代码以使其更快,而不是编写新代码)

fr1 = fopen('fcu.tsv', 'r');
fr2 = fopen('mart.tsv', 'r');

fw = fopen('out.tsv', 'w');

while feof(fr1) == 0
   line = fgetl(fr1);
   scan = textscan(line, '%s%f%f%d');

   frewind(fr2);

    while feof(fr2) == 0
        line2 = fgetl(fr2);
        scan2 = textscan(line2, '%s%s%s%f%s%s%s%s');

            if scan{1}{1} == scan2{1}{1} 

                fprintf(fw, '%s\t%f\t%f\t%d\n', line2, scan{2}, scan{3}, scan{4});

            end

    end

end

帮助表示赞赏

4

2 回答 2

2

一种使用方式awk。例如FNR == NR,它读取参数的第一个输入文件 ( fcu.tsv) 并将第一个字段作为键保存在哈希中,其余字段\t作为值连接。对于FNR < NR读取mart.tsv,如果第一个字段与哈希的键匹配,则在行尾加入其值,否则打印原始行。

内容script.awk

BEGIN {
    OFS = "\t"
}

FNR == NR {
    for ( i = 2; i <= NF; i++ ) { 
        line = (line ? line OFS : "") $i
    }   
    fcu[ $1 ] = line 
    line = ""
    next
}

FNR < NR {
    if ( $1 in fcu ) { 
        print $0 OFS fcu[ $1 ]
    }   
    else {
        print $0
    }   
}

像这样运行它:

awk -f script.awk fcu.tsv mart.tsv

具有以下输出:

NM539570 ILMN_2199362    15      58.52   protein_coding 0.204545455     0.607877092     2247
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1      0.324009324     0.578896174     2577
于 2012-07-15T10:33:22.070 回答
0

这是一个以命令行为中心的解决方案,适用于任何支持的系统coreutils,如果它不适用于您的情况,我们深表歉意。

如果mart.tsv正确填充,如下所示:

NM539570 ILMN_2199362    15      58.52   protein_coding  NA      NA                 NA                      NA
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding            molecular_function      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding            molecular_function      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_type1

解决方案可能很简单join(参见 参考资料info join):

$ join <(sort mart.tsv) <(sort fcu.tsv) | column -t
NM539570  ILMN_2199362  15  58.52  protein_coding  NA       NA                  NA                  NA         0.204545455  0.607877092  2247
NM576455  ILMN_1709067  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_1709067  1   65.74  protein_coding  protein  binding             molecular_function  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_type1  0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  protein  binding             molecular_function  SAM_2      0.324009324  0.578896174  2577

column来自bsdmainutils包。

于 2012-07-15T10:12:08.803 回答