我有一个脚本可以在登录另一个站点后下载 PDF 文件。到目前为止,它对所有网站都非常有效,但我现在对我正在抓取的新网站感到有些奇怪:下载的一些文件是 1kb(即它不起作用),而其他文件则很好。使用浏览器中的下载链接会打开“您要保存此文件吗”窗口,并且该文件在那里是正确的。
这是我的代码(我包括整个抓取过程中使用的一般 curl 参数,以及我尝试下载文件的最后部分):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
这些文件每个都不到 1mb。有任何想法吗?如何查看有关它为什么不工作(例如超时)的更多详细信息?谢谢!