php - 消除

使用 DOMxpath 还是正则表达式？

Question

我使用 DOMxpath 删除具有空文本节点但保留 标签的 html 标签，

$xpath = new DOMXPath($dom);

while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0) 
{
    foreach ($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }
}

在我遇到另一个问题之前它工作得很好，

$content = '<p><br/><br/><br/><br/></p>';

如何去除这种凌乱的 和？这意味着我不想 单独允许，但我 只允许像这样的适当文本，

$content = '<p>first break <br/> second break <br/> the last line</p>';

那可能吗？

还是使用正则表达式更好？

我尝试过这样的事情，

$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]");
    foreach($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }

但它返回此错误，

Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...

score 3 · Accepted Answer

您可以使用 XPath 选择不需要的 p：

"//p[count(*)=count(br) and br and normalize-space(.)='']"

注意选择空文本节点不应该更好地使用（？）：

"//*[normalize-space(.)='' and not(self::br)]"

这将选择没有文本节点的任何元素（但 br），节点如下：

<p><b/><i/></p>

或者

<p> <br/>   <br/>
</p>

包括。

score 1 · Accepted Answer

 您可以通过简单地检查段落中的唯一内容是空格和标签来摆脱它们：preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);

分解：

\<p\>    # Match for <p>
(        # Beginning of a group
  \s       # Match a space character
  |        # or...
  \<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and />
)*       # Match this whole group (spaces or <br /> tags) 0 or more times.
\<\/p\>  # Match for </p>

但是，我会提到，除非您的 HTML 格式正确（单行，没有奇怪的空格或段落类等），否则您不应该使用正则表达式来解析它。如果是，这个正则表达式应该可以正常工作。

score 1 · Accepted Answer

我有几乎相同的情况，我使用：

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

并用于urlencode()将其改回以显示或插入数据库。它为我工作。

php - 消除使用 DOMxpath 还是正则表达式？

3 回答 3

Related

Reference

php - 消除

使用 DOMxpath 还是正则表达式？