php - 使用 php 和 fopen 进行屏幕抓取

Question

可能重复：
使用 file_get_contents 在 php 中进行屏幕截图

谁能帮助我.. 我正在尝试从 LateRooms.com 上抓取酒店评论，不要告诉我这是一个坏主意，因为我已经获得了会员的许可

我的代码：

<?php
header('content-type: text/plain');

$contents = file_get_contents('http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx');
$contents = preg_replace('/\s(1,)/', ' ', $contents);

print $contents . "\n";

$records = preg_split('/<div id="review/', $contents);

for ($ix = 1; $ix < count($records); $ix++) {

$tmp = $records[$ix];

preg_match('/id="review"/', $tmp, $match_reviews);

print_r($match_reviews);

exit();

}
?>

这真的很好，唯一的问题是它拉入了整个代码页面并且与 div id 'review' 不匹配

提前致谢

score 3 · Accepted Answer

function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}
function DOMinnerHTML($element){ 
$innerHTML = ""; 
$children = $element->childNodes; 
foreach ($children as $child) 
{ 
    $tmp_dom = new DOMDocument(); 
    $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
    $innerHTML.=trim($tmp_dom->saveHTML()); 
} 
return $innerHTML; 
}
$url  = 'http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx';
$html = file_get_contents_curl($url);

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$div_elements = $doc->getElementsByTagName('div');

if ($div_elements->length <> 0){
foreach ($div_elements as $div_element) {
    if ($div_element->getAttribute('class') == 'review newReview'){
        $reviews[] = DOMinnerHTML($div_element);

    }
}
}

print_r($reviews);

试试这个，它将返回所有评论。您可以根据需要细化内容。

php - 使用 php 和 fopen 进行屏幕抓取

1 回答 1

Related

Reference