php - PHP网站爬虫数据提取多循环错误404

Question

我正在寻找爬取多个演出列表网站以编译最终列表指南，其中包含返回原始网站的链接。

很多这些网站都没有 API，所以我必须使用一个相当粗略的 php 脚本来提取我需要的数据。（例如日期、地点、国家等）

大多数网站都有一个相当容易使用的演出目录，但在某些网站上，他们需要手动输入信息才能获得“相关”节目给你。

所以为了解决这个问题，我创建了一个循环，该循环基于：

page.php?id=$counter+1

所以它会找到最后插入到数据库中的 gig 并继续获取下一个 100 左右的数据。

但这仅在网站上的演出将在数字上准确地继续进行的情况下才有效，当然，它们不会由于取消等原因。

这给我留下了美好的

Warning: file_get_contents(http://www.domain.com/show/page.php?id=123456) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in...

如何创建一个能够跳过这些错误并继续进行而不是仅仅坐在它们上面的循环？

以下是整个代码（目前测试限制为+5）

include_once('simple_html_dom.php');

$cntqry = mysql_query("SELECT * FROM `gigs_tbl` ORDER BY `counter` DESC LIMIT 1");
$cntnum = mysql_num_rows($cntqry);
if($cntnum!=0)
{
$cntget = mysql_fetch_assoc($cntqry); 
$start = $cntget['counter'];
}
else {
$start = 10767799;
}

$counter = 0;
$limit = $start +5;

for($start; $start < $limit; $start++) {
$counter = $start + 1;
$target_url = "http://www.domain.com/show/page.php?id=$counter";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('div[class=vevent]') as $showrow){
$artist = strip_tags($showrow->find('h2',0));
$genre = strip_tags($html->find('span[class=genre]',0));
$venue = strip_tags($showrow->find('span[class=location]',0));
$street = strip_tags($html->find('span[itemprop=streetAddress]',0));
$locality = strip_tags($html->find('span[itemprop=addressLocality]',0));
$postcode = strip_tags($html->find('span[itemprop=postalCode]',0));
$country = strip_tags($html->find('span[itemprop=addressRegion]',0));
$originalDate = strip_tags($html->find('meta[itemprop=startDate]',0)->content);
$newDate = date("U", strtotime($originalDate));
// INSERT
mysql_query("INSERT INTO `gigs_tbl` VALUES('','$counter','$newDate','$venue','$street','$locality','$postcode','$country','$gen    re','$artist','reverbnation')");
}
}

任何能猜出是哪个网站导致此问题的人都可以虚拟击掌十次；）

score 0 · Accepted Answer

find()如果什么也没找到，则返回NULL......所以，一种做你想做的事情的方法就是利用这个:)

由于您没有提供真实链接，因此这里有一个示例说明如何：

$start = 'u';

for($start; $start < 'x'; $start++) {

    // The only correct url is => http://sourceforge.net/p/mingw/bugs/
    $target_url = "http://sourceforge.net/p/ming".$start."/bugs/";

    echo "<br/> Link: $target_url";

    // @: supresses the errors when the page doesnt exist
    $data_string = @file_get_contents($target_url);

    $html = new simple_html_dom();
    // Load HTML from a string
    $html->load($data_string);

    // Find returns NULL if nothing found
    $search_elements = $html->find('#nav_menu_holder h1');

    if($search_elements) {
        echo "<br/> Page FOUND. Title => " . $search_elements[0]->plaintext;
    }
    else {
        echo "<br/> Page NOT FOUND !!";
    }

    echo "<hr>";

    // Clear DOM object
    $html->clear(); 
    unset($html);
}

PHP-Fiddle 演示

php - PHP网站爬虫数据提取多循环错误404

1 回答 1

Related

Reference