bash - 使用 sed 从一行中一次提取两段文本

Question

好的，我在 SO 上找到了类似的答案，但是我的 sed / grep / awk fu 太差了，以至于我无法完全适应我的任务。也就是说，给定这个文件“test.gff”：

accn|CP014704   RefSeq  CDS 403 915 .   +   0   ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704   RefSeq  CDS 928 2334    .   +   0   ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704   RefSeq  CDS 31437   32681   .   +   0   ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704   RefSeq  CDS 2355    2585    .   +   0   ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein

我想提取两个值1）“ID =”右侧的文本到分号和2）“product =”右侧的文本到行尾或分号（因为您可以看到其中一个这些线条也有一个“gene=”值。

所以我想要这样的东西：

ID    product
AZ909_00020    transcriptional regulator
AZ909_00025    FAD/NAD(P)-binding oxidoreductase
AZ909_00145    gamma-glutamyl-phosphate reductase

据我所知：

printf "ID\tproduct\n"

sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff

谢谢！

score 5 · Accepted Answer

尝试以下操作：

sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\1\t\2/' test.gff

相比你的尝试，我改变了你对产品的匹配方式。由于我们不知道该字段是否以;or结尾EOL，我们只匹配尽可能多的非;字符。我还在.*末尾添加了一个以匹配产品后任何可能的剩余字符。这样，当我们进行替换时，整行将匹配，我们将能够完全重写它。

如果你想要一些更健壮的东西，这里有一个 perl 单行：

perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff

这将使用正则表达式分别提取两个字段。即使字段以相反的顺序出现，它也会正常工作。

score 1 · Accepted Answer

您的正则表达式的主要问题是使用.*而不是[^;]*因为.*将匹配所有字符，但您只想匹配非分号。尝试这个：

$ sed -E 's/.*ID=([^;]+).*product=([^;]+).*/\1\t\2/' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

或者：

$ awk -F'[=;]' -v OFS='\t' '{print $2, $6}' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

您也可以使用 awk 轻松提取标头值：

$ awk -F'[=;]' -v OFS='\t' 'NR==1{sub(/.* /,"",$1); print $1, $5} {print $2, $6}' file
ID      product
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

score 1 · Accepted Answer

如果您可以使用 GNU-awk aka gawk，您可以尝试以下操作：

用 awk

gawk 'BEGIN{printf "ID\tProduct%s",RS}
     {printf "%s\t%s%s",gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0),
      gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0),RS}
    ' test.gff | expand -t20

输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

正如您所注意到的，这两个gensubs 在这里进行了繁重的工作。

在中，除了包含在和后面的第一个分号gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0)之间的内容之外的所有内容都将从记录中删除（请参阅）。Note不会修改记录本身，而只是返回打印的修改后的字符串。ID=$0gensub
in gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0), 除了中间的东西product=和第一个分号（或结尾）之外的任何东西都被剥离
最后，我们习惯于expand -t增加标签宽度以获得格式良好的输出。
由于硬编码\n是一种不好的做法，我使用内置的记录分隔符变量RS在每条记录后打印换行符。

使用类似逻辑的 sed 解决方案如下：

使用 sed

printf "%-20s%s\n" "ID" "Product"
sed -E "s/^.*[[:blank:]]+ID=([^;]*);.*;product=([^;]*)[;]*.*$/\\1\t\\2/" 39322581 | expand -t20

输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

考虑到已经为您提供了一个简短而优雅的perl解决方案，如果您可以使用 perl，您也可以考虑使用它。

^{附注：\n与 printf 一起使用会降低脚本的可移植性}

score 0 · Accepted Answer

另一个在 awk 中。我们增加 ”;” 到字段分隔符 (FS) 列表中，去掉字符串 "ID=" 和 "product=" 并打印字段 9 和 10：

$ awk -F'([ \t\n]+|;)' 'BEGIN{print "ID" OFS "Product"}{gsub(/product=|ID=/,""); print $9,$10}' test.gff
ID Product
AZ909_00020 locus_tag=AZ909_00020
AZ909_00025 locus_tag=AZ909_00025
AZ909_00145 locus_tag=AZ909_00145
AZ909_00030 locus_tag=AZ909_00030

bash - 使用 sed 从一行中一次提取两段文本

4 回答 4

Related

Reference