linux - 如何用file2中的相同编号替换相同编号的file1

Question

我有一个查询列表并在一个文件 (file1) 中点击 gi。我有另一个文件，其中有完整的命中名称（file2），现在我想将命中 gi 从 file1 替换为具有完整命中名称的 file2。我希望 gi 必须在每个对应的查询前面用相同的 gi 替换。

文件 1

 1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
 2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ 
 3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_  
 4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ 
 5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_

文件2

1  >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2  >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3  >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4  >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5  >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]

所需的输出：

1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]

score 0 · Accepted Answer

如果我运行：

file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk  'NR==FNR { _[$2]=$2;  f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}'  - $file1

为了解释它作为脚本的作用，用法如下所述，我在脚本中将文件设置为 file1.rasta，因此它需要我的输入：

./run.sh 
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following: 
 file1=file1.rasta
file2=file2.rasta
 does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------

运行它：

./run.sh ./file1.fasta ./file2.fasta 
1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_   hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]

bash 脚本 run.sh 是上面的 1 行，但分解为解释：

#!/bin/bash

 function line() { 
  echo  -e "-------------------------------------------------------------------------------"
 }

 function usage() { 
  line;
  echo "usage:"
  echo $0 file1.fasta file2.fasta is the same as  line below
  echo $0 ./file1.fasta ./file2.fasta
  echo -- This is if files are elsewhere
  echo $0 /path/to/file1.fasta /path/to/file2.fasta
  line;
 } 


 file1=$1;
 file2=$2;

 if [ $# -lt 2 ]; then 
    # Set file1 variable as filename file1.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file1=/path/to/file1.fasta
    file1=file1.rasta; 


    # Set file2 variable as filename file2.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file2=/path/to/file1.fasta

    file2=file2.rasta;
    line;
    echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
    line;
fi
 # Check we have both files whether its variables or if not variables
 # matches defined files
 if  [ ! -f  $file1 ]  || [ ! -f  $file2 ]; then
  line;
  echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
  line
  usage
  exit 2;
 fi


 # Define file 2 variable which cats file2.fasta again like above ensure 
 # the file2.fasta can be catted from this path, it pipes it into sed and changes:
 # '>gi' to 'Query=gi' and also changes '_ref_'  to ' ref_'
 # this now matches the same pattern as file1

 cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);

 # Set the internal field separator to \n which is the output of variable file2 
 IFS='\n';

  # debug enable this if you now want to see manipulated file2
  # echo $cfile2

 # Echo out cfile2 which now with the above ifs makes it like the file 
 #  formatting making \n the separator - pipe into awk command which 
 # matches against both files
 # Set up a key whilst in one which contains pattern match after:
 # .{number}_{space}* where this is what separates file2's content where tag starts.
 # If the values from $2 match on both lines print out $0 which is everything from file1 
 # plus the key which contains the details
 # the echo $cfile2  is then represented as - before $file1  at the end in effect its the first file value which is the call to file1 

echo $cfile2| awk 'NR==FNR { 
  _[$2]=$2; 
  if( match($0, /\.[0-9]\_ /)) { 
    var1=substr($0, RSTART+3);  
   }
  } 
  NR!=FNR { 
     if(_[$2] != "") print $0" "var1
  }' - $file1

## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would 
## ensure it captures the entire tag from file2
 ##echo $cfile2| awk 'NR==FNR { 
 ## _[$2]=$2; 
 ## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 
 ## } 
 ## NR!=FNR { 
 ##    if(_[$2] != "") print $0" "f1_line[key]
 ## }' - $file1

score 0 · Accepted Answer

最快（和虚拟）的解决方案之一是使用 pyhton -re 中的搜索方法来匹配字符串中的模式。我写了一个例子来说明如何做到这一点（你必须做一些检查以查看结果是否正确......）：

import re

file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()

for lineO in file2:
    strH = re.search(" ", line0)
    idN = line0[1:strH.begin()]
    namesD[idN] = line0[strH.end():]

for lineO in file1:
    strH = re.search("Hit=", line0)
    idN = line0[strH.end():].strip().replace(' ', '_')
    if namesD[idN] : 
        file3.write("Hit=" + idN + namesD[idN])

这个想法是首先从文件 2 中提取 id 及其名称并将它们添加到 dict 中（id 是键，名称是值），然后您应该逐行读取第一个文件并从点击并尝试在字典中匹配它。如果它们匹配，您可以将结果写入第 3 个文件中......或者用它做任何你想做的事情

score 0 · Accepted Answer

逐步描述解决方案；

仅从 file1 中提取 Hit GI；

cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi

# >从文件 2 中删除。
```
sed 's/^....//g' file2 > file2_1
```
删除 file2 中的冗余（如果有）；
```
cat file2_1 | sort $1 | uniq > file2_2
```

使用系统命令grep对应的地理标志名称；

cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name

打印开始 file1 的 3 列；
```
cut -d" " -f-3 file1 > file1_1
```
粘贴两个文件；
```
paste file1_1 file1-gi-name > output
```

linux - 如何用file2中的相同编号替换相同编号的file1

3 回答 3

Related

Reference