php - 如何通过dom过滤空节点？

Question

我只想获取其中包含一些真实文本或子元素节点的元素（不是空格 等）。

我尝试了以下html：

<p>&nbsp;</p>
<div>&nbsp;</div>

到目前为止，我已经尝试过这段代码：

$dom = new DOMDocument;

$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;

$i = 0;
while (is_object($html_synch = $dom->getElementsByTagName("body")->item($i))) {
    foreach ($html_synch->childNodes as $node) {
        if ($node->nodeName != "script" && $node->nodeName != "style" &&
                XML_COMMENT_NODE != $node->nodeType):
            get_children($node);
        endif;
    }
    $i++;
}

然后在 get_children 函数中，我使用此代码过滤空节点或节点 ：

foreach ($node->childNodes as $child) :
    if (trim($child->nodeValue) != ""):
        echo $child->nodeValue;  // it returns Â
        echo $child->nodeName;   // it returns #text
        array_push($children_type, $child->nodeType);
    endif;
endforeach;
print_r($children_type);

#text Â and Array ( [0] => 3 )它只返回<p> </p>. 那么我该如何过滤它们呢？而且我知道#text 是文本的特殊节点名称。

演示链接：

score 2 · Accepted Answer

事先进行一些解释：您看到 À 的原因是您的 HTML 文档被视为 UTF-8，但您将其显示为 ISO 8859-1。不间断空格 在 UTF-8: 中被编码为两个字节0xC2 0xA0。在 ISO 8859-1 中，它只是0xA0, 而0xC2意味着À

现在，您可以在中指定trim()应该修剪哪些字符，这样您就可以包含不间断空格（也必须明确添加默认字符）：

if (trim($child->nodeValue, " \n\r\t\0\xC2\xA0") !== ""):
    // value is not empty

目前您的功能没有过滤，所以我不确定您到底想对这些项目做什么。但其余的应该很容易，例如：

计算节点类型不是文本或值不为空的子节点
如果 count > 0，则保留元素

更新

你的其余代码有点粗糙，所以我做了一个最小的工作示例：

测试代码：

$html = <<<HTML
<body>
 <div>
  <p>not-empty</p>
  <p>&nbsp;</p>
  <div>&nbsp;</div>
 </div>
</body>
HTML;


$dom = new DOMDocument;

$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;

$xpath = new DOMXPath($dom);

foreach ($xpath->query('//*') as $node) {
  if (!count($node->childNodes) || trim($node->nodeValue, " \n\r\t\0\xC2\xA0")==='') {
    echo 'to filter: ' . $node->getNodePath() . "\n";
  }
}

测试输出：

to filter: /html/body/div/p[2]
to filter: /html/body/div/div

演示链接

php - 如何通过dom过滤空节点？

1 回答 1

更新

Related

Reference