regex - Bash: Regex matching on multiple lines simultaneously and extracting captured content

Question

I have a xml file in following format

<starttag name="AAA" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
<starttag name="BBB" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
</starttag>
<starttag name="CCC" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
..
..
..

I want to extract all those name attributes of starttag whose any of the innertag has value YYY.

So in the file above, the output will be AAA and CCC. I can only use regex matching. I suppose it is possible using lookaheads but not able to create regex patterns for multilines. I know how to use regex for single line and I tried using same with this also but not getting expected outputs. Anyone any headway on this.

Edit: Though I have put xml example but actually I am trying to get to know multiline regex matching and I am trying on this file which I am failing. Please avoid XML parsing related solutions.

Update: As per Steven suggestion, following worked

pcregrep -M '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

grep -Pzo '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

score 1 · Accepted Answer

考虑使用XMLStarlet

“XMLStarlet 是一组命令行实用程序（工具），可用于使用简单的 shell 命令集转换、查询、验证和编辑 XML 文档和文件，其方式类似于使用 UNIX grep、sed 处理纯文本文件、awk、diff、patch、join 等命令。”

score 0 · Accepted Answer

XML 解析器，尤其是支持 XPath 的解析器将变得更加容易和稳定，但是如果您真的必须坚持使用正则表达式，这里有一个模式可以与您提供的示例输入一起使用：

<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>

它不适用于格式良好的 XML 文档的所有变体，但只要它们像您的示例一样格式一致，您应该“没问题”。

默认情况下，正则表达式总是跨多行捕获。有一个选项可以告诉它一次只处理一行，但默认情况下通常不会打开。唯一真正的技巧是该.模式不匹配换行符，所以如果你想匹配任何字符，包括换行符，你需要使用.|\n或负字符类，例如[^>].

regex - Bash: Regex matching on multiple lines simultaneously and extracting captured content

2 回答 2

Related

Reference