html - 使用安全登录抓取特定区域的网站内容

Question

我正在尝试抓取网站的一些特定文本，这些文本是登录保护的，这里是使用 curl http://www.digeratimarketing.co.uk/2008/12/16/curl-page-scraping-script/的教程

但我无法在我的 curl 代码中实现这一点，这是我的 curl 脚本

$url = "http://aftabcurrency.com/login_script.php";

$ch = curl_init();    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

curl_setopt($ch, CURLOPT_URL, $url); 
$cookie = 'cookies.txt';
$timeout = 30;

curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT,         10); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,  $timeout );
curl_setopt($ch, CURLOPT_COOKIEJAR,       $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE,      $cookie);

curl_setopt ($ch, CURLOPT_POST, 1); 
curl_setopt ($ch,CURLOPT_POSTFIELDS,"user_name=user&user_password=pass&passcode=code");     

$result = curl_exec($ch); 
curl_close($ch); 
$source = $result;
if(preg_match("/(CC3300\">)(.*?)(<\/font>)/is",$source,$found)){
echo $found[2];
}else{
echo "Text not found.";
}

例如在 aftabcurrency.com 中，我只希望废弃“我们的服务很重要！” （此文本每天都在更改）

score 1 · Accepted Answer

我要做的是在开始和开始之间“剪切”一个文本......在源代码中，文本以文本颜色 613A75 开始，并带有结束 </font> 标记。这是一个正则表达式解决方案：

$source = file_get_contents("http://aftabcurrency.com/index.php");
if(preg_match("/(613A75\">)(.*?)(<\/font>)/is",$source,$found)){
echo $found[2];
}else{
echo "Text not found.";
}

如果您想在成员区域内使用您的文本执行此操作，请将我的源代码添加到您的源代码中，并将 $source = file_get_contents... 替换为 $source = $result

还有其他方法可以做到这一点，DomDocument 和 xpath 或简单的 strpos / strstr / substr php 函数。

html - 使用安全登录抓取特定区域的网站内容

1 回答 1

Related

Reference