0

我正在搜索/尝试将源文件中的术语列表(Ensemble Gene IDs)与目标 rnaseq.gtf 文件中的术语列表进行匹配。我想将匹配/grep 的 ENSEMBLE 基因 ID 及其相应的 RPKM1 和 RPKM2 值打印到单独的输出文件中。

source_geneid.csv 文件如下所示:

GO Genes ENSEMBLE Gene ID
AATF    ENSG00000108270
ADNP    ENSG00000101126

target_rnaseq.gtf 文件:

chr17   gencodeV7   gene    35306175    35414170    0.669763    +   .   gene_id "ENSG00000108270.5"; transcript_ids "ENST00000225402.4,"; RPKM1 "7.81399"; RPKM2 "8.149"; iIDR "0.000";
chr20   gencodeV7   gene    49505585    49547750    0.862675    -   .   gene_id "ENSG00000101126.8"; transcript_ids "ENST00000371602.2,ENST00000349014.3,ENST00000396029.3,ENST00000396032.1,ENST00000534467.1,"; RPKM1 "12.0082"; RPKM2 "8.55263"; iIDR "0.000";

包含匹配/grep 的gene_id 的输出文件及其对应的 RPKM1 和 RPMK2 值:

ENSG00000108270.5 RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8 RPKM1 "12.0082" RPKM2 "8.55263"

我已经在命令行上完成了:

grep -w "ENSG*" target_rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.txt

我也试过(感谢 fedorqui)

while read line
do
  var=$(echo $line | awk '{print $2}')
while read line
do
  var=$(echo $line | awk '{print $2}')
  grep -w "$var" target_rnaseq.gtf | awk '{print $10,$13,$14,$15,$16}' >> output.txt
done < source_geneid.csv

但它会从目标文件中打印出所有基因 ID。

4

1 回答 1

3

target_rnaseq.gtf似乎格式正确,因此您可以轻松地对其进行处理以使工作更容易,例如获取您感兴趣的值很简单:

$ awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq
ENSG00000108270.5  RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8  RPKM1 "12.0082"  RPKM2 "8.55263"

解析source_geneid.csv很简单:

$ awk 'NR>1{print $2}' geneid 
ENSG00000108270
ENSG00000101126

把它们放在一起:

$ grep -f <(awk 'NR>1{print $2}' geneid) <(awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq)
ENSG00000108270.5  RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8  RPKM1 "12.0082"  RPKM2 "8.55263"
于 2013-04-19T10:48:23.473 回答