string - 如何使用 sed/grep 提取两个单词之间的文本？

Question

我正在尝试输出一个字符串，其中包含字符串的两个单词之间的所有内容：

输入：

"Here is a String"

输出：

"is a"

使用：

sed -n '/Here/,/String/p'

包括端点，但我不想包括它们。

score 211 · Accepted Answer

GNU grep 还可以支持正负前瞻和回溯：对于您的情况，命令将是：

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

Here如果and出现多次string，您可以选择是从第一个Here到最后一个string匹配还是单独匹配。就正则表达式而言，它被称为贪婪匹配（第一种情况）或非贪婪匹配（第二种情况）

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another

score 140 · Accepted Answer

140

sed -e 's/Here\(.*\)String/\1/'

于 2012-11-06T00:14:09.013 回答

score 84 · Accepted Answer

接受的答案不会删除可能在之前Here或之后的文本String。这会：

sed -e 's/.*Here\(.*\)String.*/\1/'

主要区别在于.*紧接 beforeHere和 after的添加String。

score 46 · Accepted Answer

您可以单独在Bash中剥离字符串：

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

如果你有一个包含PCRE的 GNU grep ，你可以使用零宽度断言：

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

score 26 · Accepted Answer

如果您有一个包含许多多行出现的长文件，则首先打印数字行很有用：

cat -n file | sed -n '/Here/,/String/p'

score 26 · Accepted Answer

通过 GNU awk，

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a

带有-P（perl-regexp）参数的 grep 支持\K，这有助于丢弃以前匹配的字符。在我们的例子中，先前匹配的字符串被Here从最终输出中丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

如果您希望输出是，is a那么您可以尝试以下操作，

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

score 9 · Accepted Answer

这可能对您有用（GNU sed）：

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

Here这会在换行符上的两个标记（在本例中为和）之间呈现文本的每个表示，String并在文本中保留换行符。

score 9 · Accepted Answer

要理解sed命令，我们必须一步一步地构建它。

这是你的原文

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

让我们尝试使用ubstition 选项删除Here字符串ssed

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

String在这一点上，我相信您也可以删除

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

但这不是您想要的输出。

要组合两个 sed 命令，请使用-e选项

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

希望这可以帮助

score 9 · Accepted Answer

您可以使用两个 s 命令

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

也有效

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

score 8 · Accepted Answer

上述所有解决方案都存在缺陷，即最后一个搜索字符串在字符串的其他地方重复。我发现最好写一个 bash 函数。

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

score 4 · Accepted Answer

您可以使用\1（参考http://www.grymoire.com/Unix/Sed.html#uh-4）：

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

括号内的内容将存储为\1.

score 1 · Accepted Answer

问题。 我存储的爪子邮件消息包装如下，我正在尝试提取主题行：

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words？只要匹配的文本不包含换行符，下面的第一个表达式“有效”：

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

但是，尽管尝试了许多变体 ( .+?; /s; ...)，但我无法使这些变体起作用：

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

解决方案 1。

在不同行的两个字符串之间提取文本

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

这使

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

解决方案 2.*

Per如何使用 sed 替换换行符 (\n)？

sed ':a;N;$!ba;s/\n/ /g' corpus/01

将用空格替换换行符。

在如何使用 sed/grep 提取两个单词之间的文本？，我们得到：

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

这使

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

此变体删除了双空格：

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给予

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

string - 如何使用 sed/grep 提取两个单词之间的文本？

12 回答 12

Related

Reference