php - 在 PHP 中使用 DomDocument 从叶节点中提取文本

Question

我正在使用 PHP 检索不同的网页，然后将它们加载到 DomDocument，但是我在仅从叶节点中提取文本时遇到问题。

例如，假设我有以下内容：

<html>
    <body>
        <div class="this_is_our_div_of_interest">
            <div>
                <div>
                    <p>Some text</p>
                    <div>Some <a href='#'>more</a> text</div>
                    <p>And <span><strong>another</strong></span> paragraph</p>
                </div>
                <p>Yay<p>
            </div>
            <div>
                <h4>abcd</ph4>
                xyz
            <div>
        </div>
        <div class="we_do_not_want_those_divs">
            <p>This text is not important to us</p>
        </div>
    </body>
</html>

如您所见，这是一个混乱的输入，但是预期的“回显”输出是：

Some text
Some more text
And another paragraph
Yay
abcd
xyz

请注意输出中的以下内容

我只检索特定标签的输出（在我们的例子中，this_is_our_div_of_interest）
这不是上面提供的树的特定格式，因为它来自网页 tjat 我无法控制它的内容，但是，我只喜欢带标签的内容，例如div和p似乎是叶节点
有一些标签需要省略，例如a、span和strong（其他可能添加到列表中）

更新我正在使用 xpath 来访问类，例如，以下代码行会将所有后代作为单独的节点：

$nodes = $xpath->query("//div[@class='this_is_our_div_of_interest']/descendant::*");

score 0 · Accepted Answer

您可以执行以下操作：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html');
$id = $dom->getElementById('youNeedAnIdForThis');

现在访问$id.

不幸的是，没有getElementsByClassName，但我在http://pastebin.com/4qYMEGqV找到了一个。然后您的代码将如下所示：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html');
$class = getElementsByClassName($dom, 'this_is_our_div_of_interest');

$class[0]现在应该持有你正在寻找的东西

那么也许你应该strip_tags()，如果你只是想要文字。

也许看看 DOMNode http://www.php.net/manual/en/class.domnode.php#domnode.props.childnodes？

php - 在 PHP 中使用 DomDocument 从叶节点中提取文本

1 回答 1

Related

Reference