php - 合并 DOM 查询和 file_get_contents

Question

在过去的几天里，我对此进行了相当多的研究，并且在网上找到了各种功能的所有答案，所以谢谢。

我现在有 3 段单独的代码，它们都抓取网页的内容（该页面将是电子商务产品页面、评论页面、带有产品的页面）以获取不同的信息，但我假设这是非常3次抓取内容效率低下！

3 位代码完成以下 3 件事：1) 获取网页标题 2) 获取页面中的所有图像 3) 查找数字以获取（希望是）该页面上项目的价格。

我将不胜感激将这些组合在一起的帮助，因此它只需要获取文件内容一次。这是我当前的代码：第一次：

function getDetails($Url){
    $str = file_get_contents($Url);
    if(strlen($str)>0){
        //preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
        //The above didnt work well enough (for getting Title when <title id=... > etc) so used the DOM below



            preg_match("/(\£[0-9]+(\.[0-9]{2})?)/",$str,$price); //£ for GBP
            $priceRes = preg_replace("/[^0-9,.]/", "", $price[0]);

            //$pageDeatil[0]=$title;
            $pageDeatil[1]=$priceRes;
            return $pageDeatil;

    }
}

$pageDeatil = getDetails("$newItem_URL");
//$itemTitle = $pageDeatil[0];
$itemPrice = $pageDeatil[1];

第二次：

$doc = new DOMDocument();
@$doc->loadHTMLFile("$newItem_URL");
$xpath = new DOMXPath($doc);
$itemTitle = $xpath->query('//title')->item(0)->nodeValue."\n";

第三次：

include('../../code/simplehtmldom/simple_html_dom.php');
include('../../code/url_to_absolute/url_to_absolute.php');

$html = file_get_html($newItem_URL);
foreach($html->find('img') as $e){

$imgURL =  url_to_absolute($url, $e->src);
    //More code here

}

我似乎无法获得该文件一次，然后在其余部分中仅使用该文件。任何帮助，将不胜感激！提前致谢。

score 1 · Accepted Answer

我更喜欢在抓取网站时使用 cURL。您的价格获取代码似乎也不是特别有效，我认为您也应该在那里使用 XPath。函数的返回可以是一个带有价格、标题和图像数组的对象。

function get_details($url) {
   $ch = curl_init($url);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
   curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

   $html = curl_exec($ch);

   $dom = new DOMDocument();
   @$dom->loadHTML($html);
   $xpath = new DOMXPath($dom);

   $product         = new stdClass;
   $product->title  = $xpath->query('//title')->item(0)->nodeValue;
   $product->price  = // price query goes here
   $product->images = array();

   foreach($xpath->query('//img') as $image) {
      $product->images[] = $image->getAttribute('src');
   }

   return $product;
}

php - 合并 DOM 查询和 file_get_contents

1 回答 1

Related

Reference