awk - 如何将 awk 限制为仅搜索包含在某个 HTML 标记内的项目？

Question

我有一个这样的 AWK 脚本，我将在一个文件上运行它：

cat input.txt | awk 'gsub(/[^ ]*(fish|shark|whale)[^ ]*/,"(&)")' >> output.txt

这会为所有包含单词“fish”、“shark”或“whale”的行添加括号，例如：

The whale asked the shark to swim elsewhere.
The fish were unhappy.

通过脚本运行后，文件变为：

The (whale) asked the (shark) to swim elsewhere.
The (fish) were unhappy.

该文件标有 HTML 标签，我需要让替换只发生在和标签之间。

The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.

这变成：

The whale asked <b> the (shark) to swim </b> elsewhere.
<b> The (fish) were </b> unhappy.

匹配的粗体标签永远不会放在不同的行上。开始标签总是与结束标签出现在同一行。

如何将awk的搜索限制为仅搜索和修改在和标签之间找到的文本？

score 1 · Accepted Answer

这是一种使用的技术awk：

awk '/<b>/{f=1}/<\/b>/{f=0}f{gsub(/fish|shark|whale/,"(&)")}1' RS=' ' ORS=' ' file
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.

score 1 · Accepted Answer

只要 HTML 标记不比这差，并且 ... span 不包含任何其他 HTML 标记，那么在 Perl 中就相对容易了：

$ cat data
The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ perl -pe 's/(<b>[^<]*)\b(fish|shark|whale)\b([^<]*<\/b>)/\1(\2)\3/g'  data | so
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$

我尝试将其调整为awk（和gawk），但没有成功；匹配部分有效，但替换表达式没有。与 Perl 不同，阅读手册后，您无法识别括号中的单独匹配子表达式。

awk - 如何将 awk 限制为仅搜索包含在某个 HTML 标记内的项目？

2 回答 2

Related

Reference