0

(顺便说一句,我在获得相关网站的许可的情况下抓取这些东西)。

非常简单的网络抓取工具,当我手动加载所有链接时工作正常,但是当我尝试通过 JSON 和变量加载它们时(所以我可以用一个脚本进行大量抓取并使过程更加模块化通过添加更多指向 JSON 的链接)它在无限循环中运行。

(页面已经加载了大约 15 分钟)

这是我的 JSON。那里只有一家商店用于测试目的,但还会有大约 15 家。

[
   {
      "store":"Incu Men",
      "cat":"Accessories",
      "general_cat":"Accessories",
      "spec_cat":"accessories",
      "url":"http://www.incuclothing.com/shop-men/accessories/",
      "baseurl":"http://www.incuclothing.com",
      "next_select":"a.next",
      "prod_name_select":".infobox .fn",
      "label_name_select":".infobox .brand",
      "desc_select":".infobox .description",
      "price_select":"#price",
      "mainImg_select":"",
      "more_imgs":".product-images",
      "product_url":".hproduct .photo-link"
   }
]

这是 PHP 爬虫代码:

<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
  $ch = curl_init();
    // Initialising cURL
    curl_setopt($ch, CURLOPT_URL, $url);
    // Setting cURL's URL option with the $url variable passed into the function
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    // Setting cURL's option to return the webpage data
    $data = curl_exec($ch);
    // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);
    // Closing cURL
    return $data;
    // Returning the data from the function
}

function getLinks($catURL, $prodURL, $baseURL, $next_select) {
    $urls = array();

    while($catURL) {
        echo "Indexing: $url" . PHP_EOL;
        $html = str_get_html(curl($catURL));

        foreach ($html->find($prodURL) as $el) {
            $urls[] = $baseURL . $el->href;
        }

        $next = $html->find($next_select, 0);
        $url = $next ? $baseURL . $next->href : null;

        echo "Results: $next" . PHP_EOL;
    }

    return $urls;
}

$string     = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);

foreach ($json_array as $value){

    $baseURL = $value['baseurl'];
    $catURL = $value['url'];
    $store = $value['store'];
    $general_cat = $value['general_cat'];
    $spec_cat = $value['spec_cat'];
    $next_select = $value['next_select'];
    $prod_name = $value['prod_name_select'];
    $label_name = $value['label_name_select'];
    $description = $value['desc_select'];
    $price = $value['price_select'];
    $prodURL = $value['product_url'];

    if (!is_null($value['mainImg_select'])){
        $mainImg = $value['mainImg_select'];
    }
    $more_imgs = $value['more_imgs'];



    $allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);

}

?>

任何想法为什么脚本会无限运行并且不返回任何内容/停止/打印任何内容到屏幕?我只是让它运行直到它停止。当我手动执行此操作时,只需要一分钟左右,有时甚至更少,所以我确定这是我的变量/json 的问题,但我无法终生看到问题所在。

任何人都可以快速浏览并指出我正确的方向吗?

4

1 回答 1

3

while($catURL)你的循环有问题。你想让我做什么 ?flush()此外,您可以使用该命令强制在浏览器上显示信息。

于 2013-05-16T06:19:13.907 回答