regex - 使用正则表达式匹配对标记

Question

我正在尝试从 xhtml 文档中检索特定标签及其内容，但它匹配错误的结束标签。

在以下内容中：

<cache_namespace name="content">
    <content_block id="15">
    some content here

        <cache_namespace name="user">
            <content_block id="welcome">
            Welcome Apikot!
            </content_block>
        </cache_namespace>
    </content_block>
</cache_namespace>

id="welcome" 的 content_block 结束标记实际上与第一个开始 content_block 标记的结束标记匹配。

我正在使用的正则表达式是：

/<content_block id="(.*)">([\w\W]*?)<\/content_block>/i

关于我在哪里失败的任何指示？

score 6 · Accepted Answer

……答案总是一样的：HTML + regex cannot be done。对不起。为您的特定框架使用 HTML 解析库。或者，如果保证您的文档只包含有效的 XHTML，请采用 jitter 在评论中提出的 XPath 方法。

score 3 · Accepted Answer

这可能有助于我在http://www.regular-expressions.info/examples.html上找到教程，其中提到捕获给定文本中重复出现的一对字符串。建议是使用？after .* 使其在文本中第一次出现该对的结束字符串后停止

score 1 · Accepted Answer

这是正则表达式的一个已知问题 - 您无法匹配对。匹配要么是贪婪的，它匹配它找到的最后一个，要么是非贪婪的，它匹配第一个。您无法说服正则表达式计算左括号和右括号。

我建议将其加载到 DOM 中并使用它。如果您尝试实现 HTML 解析器，我建议您使用正则表达式对其进行词法分析，然后使用左右解析器来解析词法分析器的输出。

score 0 · Accepted Answer

感谢@Jan Żankowski和@ikegami，他们的回答给了我灵感

让我用PHP来演示代码

<?php
$xml = <<<EOT
<cache_namespace name="content">
    <content_block id="15">
    some content here

        <cache_namespace name="user">
            <content_block id="welcome">
            Welcome Apikot!
            </content_block>
        </cache_namespace>
    </content_block>
</cache_namespace>
EOT;

preg_match('/<cache_namespace[^>]+>((?:(?!(<\/?div>)).)*)<\/cache_namespace>/s', $xml, $matches);
print_r($matches);

正则表达式注释

s选项：.模式中的 a 匹配所有字符，包括换行符
这里的关键是(?:(?!STRING).)*字符串和[^CHAR]*字符一样

结果

Array
(
    [0] => <cache_namespace name="content">
    <content_block id="15">
    some content here

        <cache_namespace name="user">
            <content_block id="welcome">
            Welcome Apikot!
            </content_block>
        </cache_namespace>
    </content_block>
</cache_namespace>
    [1] => 
    <content_block id="15">
    some content here

        <cache_namespace name="user">
            <content_block id="welcome">
            Welcome Apikot!
            </content_block>
        </cache_namespace>
    </content_block>

)

score -2 · Accepted Answer

解析 XHTML 或 XML 并不难。我假设您有有效或格式正确的代码。

#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my $xml = <<"EOF";
<cache_namespace name="content">
    <content_block id="15">
    some content here

        <cache_namespace name="user">
            <content_block id="welcome">
            Welcome Apikot!
            </content_block>
        </cache_namespace>
    </content_block>
</cache_namespace>
EOF

while ($xml =~ m!
<(content_block)\sid="welcome"> # Start tag definition.
 (\s*                           # It may consists of
   (?: <\!--.*?-->              # - comment
   |  [^<]*                     # - text
   |  <[^>]+/>                  # - another closed tag
   |  <\s*(\w+)[^>]*>           # - another tag with some content
       (?2)+                    # (recursive definition of possible tag content)
      </\3>
   )
 )*
</\1>
!sxgc) {
    print "==> $&\n\n";
}

请修改其他内容的开始标签定义（如<\s*(\w+)[^>]*+>）。无论如何，这是一个很好的起点。

如果您不使用递归（符合(?2)+），您将停留在此类示例上。这段代码可以处理所有这些（请看这里）或者可以很容易地适应新的情况。

regex - 使用正则表达式匹配对标记

5 回答 5

Related

Reference