1

I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information

I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing.

so I have a $crawler object. And I've tried to loop through the following to get what I want:

$title = $crawler->filterXPath('//td[. = "Title"]/following-sibling::td[1]')->each(funtion (Crawler $node, $i) {
        return $node->text();
});

I'm not sure what Crawler $node, I just got it from the example on the web page. Perhaps if I can get this working, then it will loop through each node in the $crawler object and find what I'm actually looking for.

Here's an example of the page:

<table> 
<tr>
    <td>Title</td>
    <td>The Harsh Face of Mother Nature</td>
   <td>The Harsh Face of Mother Nature</td>
</tr>
.
.
.
</table>

And this is just one table, there are many tables and a huge sloppy mess outside of this one. Any ideas?

(Note: earlier I was able to apply a filter to the $crawler object for some information I needed, then I serialize() the information, and has a string finally, which made sense. But I cannot get a string at all anymore, idk why.)

4

2 回答 2

1

DomCrawler html() 函数不会根据函数描述转储整个 html:

http://api.symfony.com/2.6/Symfony/Component/DomCrawler/Crawler.html#method_html

它仅返回它在您的情况下所做的第一个节点。

您可以使用http://php.net/manual/en/domdocument.savehtml.php因为 DomCrawler 是一组 SplObjectStorage 。

$html = $crawler->getNode(0)->ownerDocument->saveHTML();
于 2015-03-26T02:56:40.220 回答
0

如果您查看Crawler::html()的源代码,您将看到它正在执行以下操作:

$html = '';
foreach ($this->getNode(0)->childNodes as $child) {
    $html .= $child->ownerDocument->saveHTML($child);
}
return $html;
于 2015-04-02T18:05:27.247 回答