bash - 使用 awk 打印标题名称和子字符串

Question

我尝试使用此代码打印基因名称的标题，然后根据其位置提取子字符串，但它不起作用

>output_file
cat input_file | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' tmp`
        startPos=`awk '{print $2}' tmp`
        endPOs=`awk '{print $3}' tmp`
                for i in temp; do
                echo ">${geneName}" >> genes_fasta ;
                echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
        done
done

输入文件

nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548

法斯塔

atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............

这是我错误的输出文件

>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta

输出应如下所示：

>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

score 3 · Accepted Answer

您可以通过一次调用来做到这一点，这比在 shell 脚本中循环和每次迭代awk调用 4 次要高效几个数量级。awk由于您有 bash，您可以简单地使用命令替换并将其内容重定向fasta到一个awk变量，然后简单地从fasta文件中输出包含开头到结尾字符的标题和子字符串。

例如：

awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

或getline在BEGIN规则内使用：

awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

示例输入文件

注意：开始和结束值已减少到适合您示例的 129 个字符：

$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79

以及示例的前 129 个字符fasta

$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc

示例使用/输出

$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg

仔细看看，如果我理解你的问题要求，请告诉我。如果您对解决方案还有其他问题，也请告诉我。

score 1 · Accepted Answer

如果我理解正确，那么：

awk 'NR==FNR {fasta = fasta $0; next}
    {
        printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
    }' fasta input_file > genes_fasta

它首先读取fasta文件并将序列存储在变量fasta中。
然后它input_file逐行读取，提取 start fastaat$2和 length的子字符串$3 - $2 + 1。（请注意，substr函数的第三个参数是长度，而不是 endpos。）

希望这可以帮助。

score 1 · Accepted Answer

让它工作！这是从 fasta 文件中提取子字符串的脚本

cat genes_and_bounderies1 | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' temp`
        startPos=`awk '{print $2}' temp`
        endPos=`awk '{print $3}' temp`
        length=$(expr $endPos - $startPos)
                for i in temp; do
                echo ">${geneName}" >> genes_fasta
                awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
        done
done

bash - 使用 awk 打印标题名称和子字符串

3 回答 3

Related

Reference