0

I have a xml file in following format

<starttag name="AAA" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
<starttag name="BBB" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
</starttag>
<starttag name="CCC" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
..
..
..

I want to extract all those name attributes of starttag whose any of the innertag has value YYY.

So in the file above, the output will be AAA and CCC. I can only use regex matching. I suppose it is possible using lookaheads but not able to create regex patterns for multilines. I know how to use regex for single line and I tried using same with this also but not getting expected outputs. Anyone any headway on this.

Edit: Though I have put xml example but actually I am trying to get to know multiline regex matching and I am trying on this file which I am failing. Please avoid XML parsing related solutions.

Update: As per Steven suggestion, following worked

pcregrep -M '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

grep -Pzo '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml
4

2 回答 2

1

考虑使用XMLStarlet

“XMLStarlet 是一组命令行实用程序(工具),可用于使用简单的 shell 命令集转换、查询、验证和编辑 XML 文档和文件,其方式类似于使用 UNIX grep、sed 处理纯文本文件、awk、diff、patch、join 等命令。”

于 2016-01-28T13:31:32.340 回答
0

XML 解析器,尤其是支持 XPath 的解析器将变得更加容易和稳定,但是如果您真的必须坚持使用正则表达式,这里有一个模式可以与您提供的示例输入一起使用:

<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>

它不适用于格式良好的 XML 文档的所有变体,但只要它们像您的示例一样格式一致,您应该“没问题”。

默认情况下,正则表达式总是跨多行捕获。有一个选项可以告诉它一次只处理一行,但默认情况下通常不会打开。唯一真正的技巧是该.模式不匹配换行符,所以如果你想匹配任何字符,包括换行符,你需要使用.|\n或负字符类,例如[^>].

于 2016-01-28T13:35:37.707 回答