php - 使用正则表达式解析非节点、间歇性 XML 值

Question

这是正则表达式大师的问题。

如果我有一系列 xml 节点，我想解析出（使用正则表达式）与当前节点存在于同一级别的包含节点值。例如，如果我有：

<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
<top-node>

我想检索一个数组：

array(
    1 => 'Hi',
    2 => 'Hey',
    3 => 'Bar'
)

我知道我可以从

$inside = preg_match('~<(\S+).*?>(?P<inside>(.|\s)*)</\1>~', $original_text);

这将检索没有top-node. 但是，下一步有点超出我的正则表达式能力。

编辑：实际上， preg_match 似乎只有在它们$original_text都在同一行时才有效。此外，我认为我可以使用 apreg_split和非常相似的正则表达式来检索我正在寻找的内容——它只是不能跨多行工作。

注意：我很感激并会答应任何澄清的要求；但是，我的问题非常具体，我的意思是我要问的，所以不要给出“去使用 SimpleXML”之类的答案。感谢您的任何帮助。

score 1 · Accepted Answer

描述

此正则表达式将捕获第一级文本

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?[\s\r\n]*\K(?!\Z)(?:(?![\s\r\n]*(?:<|\Z)).)*1

在此处输入图像描述

展开

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?   # match any open tags until the close tags if they exist
[\s\r\n]*    # match any leading spaces or new line characters 
\K           # reset the capture and only capture the desired substring which follows
(?!\Z)       # validate substring is not the end of the string, this prevents the phantom empty array value at the end
(?:(?![\s\r\n]*(?:<|\Z)).)*    # capture the text inside the current substring, this expression is self limiting and will stop when it sees whitespace ahead followed by end of string or a new tag

例子

示例文本

这是假设您已删除第一个顶级标签

Hi
<second-node>
    Hello
    <inner-node>
    </inner-node>
</second-node>
Hey
<third-node>
   Foo
</third-node>
Bar

捕获组

0：是实际捕获的组
1：是子标签的名称，然后在正则表达式中反向引用

[0] => Array
    (
        [0] => Hi
        [1] => Hey
        [2] => Bar
    )

[1] => Array
    (
        [0] => 
        [1] => second-node
        [2] => third-node
    )

免责声明

此解决方案将挂在嵌套结构上，例如：

Hi
<second-node>
    Hello
    <second-node>
    </second-node>
    This string will be found
</second-node>
Hey

score 1 · Accepted Answer

根据您自己的想法，使用preg_split我想出了：

$raw="<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
</top-node>";

$reg='~<(\S+).*?>(.*?)</\1>~s';
preg_match_all($reg, $raw, $res);
$res = explode(chr(31), preg_replace($reg, chr(31), $res[2][0]));

注意，chr(31) 是“单位分隔符”

使用以下命令测试结果数组：

echo ("<xmp>start\n" . print_r($res, true) . "\nfin</xmp>");

这似乎适用于 1 个节点，为您提供您要求的数组，但它可能会遇到各种问题。您可能希望将返回的值修剪为。

编辑：
Denomales 的回答可能更好..

php - 使用正则表达式解析非节点、间歇性 XML 值

2 回答 2

描述

例子

免责声明

Related

Reference