php - 简单的 HTML DOM PHP 网络爬虫不跟随初始爬取页面的链接

翻译自：https://stackoverflow.com/questions/19484901 2013-10-21T01:22:00.810

698 次

我正在使用Simple HTML DOM开发一个 PHP 网络爬虫。这是我的代码：

<?php

include_once('simplehtmldom/simple_html_dom.php');


$seeds = [
    'http://www.google.com/?q=web+development#q=web+development',
    'http://www.google.com/?q=art#q=art'
];

// Web crawl
function crawl($seeds) {
    foreach($seeds as $key) {
        $html = new simple_html_dom();
        $html->load_file($key);
        foreach ($html->find('a') as $link) {
            array_push($seeds, $link->href);
        }
    }
    $seeds = array_unique($seeds);
    print_r($seeds);
}

?>

该字符串simplehtmldom/simple_html_dom.php是简单 HTML DOM 的路径。问题是它只抓取$seeds数组中最初的 2 个 URL (' http://www.google.com/?q=web+development#q=web+development ',' http://www.google.com /?q=艺术#q=艺术')。但是，我希望它抓取第二个foreach循环推送到数组的所有 URL。我该如何解决？

最后，处理不断增加的$seeds数组的最佳方法是什么？它会不停地爬行，所以我想跟踪所有的 URL。我应该将它写入文件，还是我最好的选择是在这么长时间后简单地停止它（最好的方法是什么？）？我需要能够在另一个也在服务器上运行的 PHP 文件中或从同一个 PHP 文件中并行使用该数组。

php - 简单的 HTML DOM PHP 网络爬虫不跟随初始爬取页面的链接

0 回答 0

Related

Reference