1

我有一个带有以下文本的文本文件

Query= gi_4849 ref_YP_00.1_ flagellar assembly protein H[Bacillus]--
Query= gi_4851 ref_YP_00.1_ MS-ring protein[Bacillus]--
Query= gi_4852 ref_YP_00.1_ flagellar hook-basal body proteinFliE [Bacillus]--
Query= gi_4851 ref_YP_00.1_ [membrane protein][Bacillus]--
.
.
.

期望的输出:

flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE
[membrane protein]
.
.
.

我试过以下命令;

sed '/.1_/,/[Bacillus/p' filename > new
sed '/".1_"/,/"[Bacillus"/p' filename > new
awk '/.1_/,/[Bacillus/' filename > new
awk '/".1_"/,/"[Bacillus"/' filename > new

awk不工作并sed给出错误。

sed: -e expression #1, char 19: unterminated address regex
4

3 回答 3

1

你只想打印部分行匹配然后GNU Grep你可以这样做:

$ grep -Po '_\s\K.*(?=[[])' file
flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE 
[membrane protein] 

或更明确地说:

$ grep -Po '(?<=ref_YP_00.1_ ).*(?=\[Bacillus]--)' file
flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE 
[membrane protein]

如果您想考虑可选的尾随空格:

$ grep -Po '_\s\K.*\S(?=\s?[[])' file 
flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE
[membrane protein]

# OR

$ grep -Po '(?<=ref_YP_00.1_ ).*\S(?=\s?\[Bacillus]--)' file 
flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE
[membrane protein]
于 2013-09-26T11:44:26.383 回答
1

使用sed此代码可以:

$ sed -r 's/.*1_ (.*)\[Bacillus.*/\1/g' file
flagellar assembly protein H
MS-ring protein
flagellar hook-basal body proteinFliE 
[membrane protein]

它获取行并捕获块 from 1_to的匹配组 #1 [Bacillus,然后将其打印回来。

于 2013-09-26T11:40:57.383 回答
0
perl -lne 'print $1 if(/1_ (.*?)\[Bacillus*/)' your_file
于 2013-09-26T12:29:39.440 回答