php - 使用 PHP + SimpleXML 抓取...我可以抓取图像但不能抓取原始文本？

Question

我正在尝试从网站上获取特定的原始文本。通过使用这个站点和其他资源，我学会了如何使用 simpleXML 和 xpath 获取特定图像。

然而，相同的方法似乎不适用于抓取原始文本。这是现在不起作用的东西。

// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';

// then I create a new DOM Document
$html = new DOMDocument();

// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);

// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);   

// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);

现在，当我抓取图像时，该 commcount 对象将返回一个数组，其中包含其中某处的图像源。

在这种情况下，我希望该对象返回包含在“storyCommentCountNumber”div 中的原始文本。但该文本似乎不包含在对象中，只是 Div 的名称。

我究竟做错了什么？我可以看到这种方法仅用于抓取 HTML 元素及其内部的位，而不是原始文本。如何获取该 div 中的文本？

谢谢！

score 2 · Accepted Answer

需要注意的一件事是，当您在 SimpleXML 对象上使用 print_r 或 var_dump 时，您不会看到对象的“文本”（或有时是属性）。因此，要查看所有内容，您应该使用 $variable->AsXml() 输出完整的 XML 字符串。

要获取文本，您需要将 SimpleXml 对象转换为字符串。这会自动拉出 innerText。

 /* remember $commcount is always an array from the xpath */
 foreach($commcount as $str)
 {
     echo (string)$str;
 }

希望以上内容能给你一个开始。

score 1 · Accepted Answer

我知道您正在尝试使用 SimpleXML，但我认为使用正则表达式获取原始文本会更容易。

score 1 · Accepted Answer

您能否包含 HTML 示例（可能包括您选择的元素之前和之后的几行？）和 print_r() 的输出？

您可以尝试以下方法，看看是否对您有帮助：

if ( count($commcount) > 0 ) {
    $divContent = $commcount[0]->asXml();
    print $divContent;
}

score 0 · Accepted Answer

div 内的原始文本不是 div 元素本身的一部分，而是 div 元素的第一个子节点的一部分。div 中应该有一个文本节点，其中包含您要查找的数据。

score 0 · Accepted Answer

0

尝试检查此页面。

:)

于 2009-01-01T01:12:26.407 回答

php - 使用 PHP + SimpleXML 抓取...我可以抓取图像但不能抓取原始文本？

5 回答 5

Related

Reference