0

我有一个查询列表并在一个文件 (file1) 中点击 gi。我有另一个文件,其中有完整的命中名称(file2),现在我想将命中 gi 从 file1 替换为具有完整命中名称的 file2。我希望 gi 必须在每个对应的查询前面用相同的 gi 替换。

文件 1

 1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
 2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ 
 3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_  
 4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ 
 5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_ 

文件2

1  >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2  >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3  >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4  >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5  >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]

所需的输出:

1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
4

3 回答 3

0

如果我运行:

file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk  'NR==FNR { _[$2]=$2;  f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}'  - $file1 

为了解释它作为脚本的作用,用法如下所述,我在脚本中将文件设置为 file1.rasta,因此它需要我的输入:

./run.sh 
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following: 
 file1=file1.rasta
file2=file2.rasta
 does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------

运行它:

./run.sh ./file1.fasta ./file2.fasta 
1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_   hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 

bash 脚本 run.sh 是上面的 1 行,但分解为解释:

#!/bin/bash

 function line() { 
  echo  -e "-------------------------------------------------------------------------------"
 }

 function usage() { 
  line;
  echo "usage:"
  echo $0 file1.fasta file2.fasta is the same as  line below
  echo $0 ./file1.fasta ./file2.fasta
  echo -- This is if files are elsewhere
  echo $0 /path/to/file1.fasta /path/to/file2.fasta
  line;
 } 


 file1=$1;
 file2=$2;

 if [ $# -lt 2 ]; then 
    # Set file1 variable as filename file1.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file1=/path/to/file1.fasta
    file1=file1.rasta; 


    # Set file2 variable as filename file2.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file2=/path/to/file1.fasta

    file2=file2.rasta;
    line;
    echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
    line;
fi
 # Check we have both files whether its variables or if not variables
 # matches defined files
 if  [ ! -f  $file1 ]  || [ ! -f  $file2 ]; then
  line;
  echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
  line
  usage
  exit 2;
 fi


 # Define file 2 variable which cats file2.fasta again like above ensure 
 # the file2.fasta can be catted from this path, it pipes it into sed and changes:
 # '>gi' to 'Query=gi' and also changes '_ref_'  to ' ref_'
 # this now matches the same pattern as file1

 cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);

 # Set the internal field separator to \n which is the output of variable file2 
 IFS='\n';

  # debug enable this if you now want to see manipulated file2
  # echo $cfile2

 # Echo out cfile2 which now with the above ifs makes it like the file 
 #  formatting making \n the separator - pipe into awk command which 
 # matches against both files
 # Set up a key whilst in one which contains pattern match after:
 # .{number}_{space}* where this is what separates file2's content where tag starts.
 # If the values from $2 match on both lines print out $0 which is everything from file1 
 # plus the key which contains the details
 # the echo $cfile2  is then represented as - before $file1  at the end in effect its the first file value which is the call to file1 

echo $cfile2| awk 'NR==FNR { 
  _[$2]=$2; 
  if( match($0, /\.[0-9]\_ /)) { 
    var1=substr($0, RSTART+3);  
   }
  } 
  NR!=FNR { 
     if(_[$2] != "") print $0" "var1
  }' - $file1

## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would 
## ensure it captures the entire tag from file2
 ##echo $cfile2| awk 'NR==FNR { 
 ## _[$2]=$2; 
 ## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 
 ## } 
 ## NR!=FNR { 
 ##    if(_[$2] != "") print $0" "f1_line[key]
 ## }' - $file1
于 2013-12-30T11:07:23.933 回答
0

最快(和虚拟)的解决方案之一是使用 pyhton -re 中的搜索方法匹配字符串中的模式。我写了一个例子来说明如何做到这一点(你必须做一些检查以查看结果是否正确......):

import re

file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()

for lineO in file2:
    strH = re.search(" ", line0)
    idN = line0[1:strH.begin()]
    namesD[idN] = line0[strH.end():]

for lineO in file1:
    strH = re.search("Hit=", line0)
    idN = line0[strH.end():].strip().replace(' ', '_')
    if namesD[idN] : 
        file3.write("Hit=" + idN + namesD[idN])

这个想法是首先从文件 2 中提取 id 及其名称并将它们添加到 dict 中(id 是键,名称是值),然后您应该逐行读取第一个文件并从点击并尝试在字典中匹配它。如果它们匹配,您可以将结果写入第 3 个文件中......或者用它做任何你想做的事情

于 2013-12-30T11:14:10.527 回答
0

逐步描述解决方案;

  1. 仅从 file1 中提取 Hit GI;

    cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi
    
  2. # >从文件 2 中删除。

    sed 's/^....//g' file2 > file2_1
    
  3. 删除 file2 中的冗余(如果有);

    cat file2_1 | sort $1 | uniq > file2_2
    
  4. 使用系统命令grep对应的地理标志名称;

    cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name
    
  5. 打印开始 file1 的 3 列;

    cut -d" " -f-3 file1 > file1_1
    
  6. 粘贴两个文件;

    paste file1_1 file1-gi-name > output
    
于 2013-12-31T12:33:56.997 回答