php - PHP Dom 抓取大量数据

Question

我必须从超过 8000 页 x 每页 25 条记录中收集一些数据。大约有超过 200.000 条记录。问题是服务器在一段时间后拒绝了我的请求。虽然我听说它相当慢，但我使用 simple_html_dom 作为抓取的库。这是样本数据：

<table>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data1</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data2</td>
</tr>
<tr>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data3</td>
<td width="50%" valign="top" style="font-size:12px;border-bottom:1px dashed #a2a2a2;">Data4</td>
</tr>
</table>

php抓取脚本是：

<?php

$fileName = 'output.csv';

header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header('Content-Description: File Transfer');
header("Content-type: text/csv");
header("Content-Disposition: attachment; filename={$fileName}");
header("Expires: 0");
header("Pragma: public");

$fh = @fopen('php://output', 'w');


ini_set('max_execution_time', 300000000000);

include("simple_html_dom.php");

for ($i = 1; $i <= 8846; $i++) {

    scrapeThePage('url_to_scrape/?page=' . $i);
    if ($i % 2 == 0)
        sleep(10);

}

function scrapeThePage($page)
{

    global $theData;


    $html = new simple_html_dom();
    $html->load_file($page);

    foreach ($html->find('table tr') as $row) {
        $rowData = array();
        foreach ($row->find('td[style="font-size:12px;border-bottom:1px dashed #a2a2a2;"]') as $cell) {
            $rowData[] = $cell->innertext;

        }

        $theData[] = $rowData;
    }
}

foreach (array_filter($theData) as $fields) {
    fputcsv($fh, $fields);
}
fclose($fh);
exit();

?>

如您所见，我在 for 循环中添加了 10 秒的睡眠间隔，因此我不会因请求而对服务器造成压力。当它提示我下载 CSV 时，我在其中包含以下几行：

警告：file_get_contents(url_to_scrape/?page=8846)：打开流失败：HTTP 请求失败！HTTP/1.0 500 内部服务器错误 致命错误：在第1113行调用 D:\www\htdocs\ucmr\simple_html_dom.php 中非对象的成员函数 find()

8846 页面确实存在，它是脚本的最后一页。上述错误中的页码有所不同，因此有时我会在第 800 页收到错误。有人可以告诉我在这种情况下我做错了什么。任何意见将是有益的。

score 0 · Accepted Answer

Fatal 被抛出可能是因为$html或者$row不是一个对象，它变成了null. 您应该始终尝试检查对象是否正确创建。$html->load_file($page);如果加载页面失败，也许方法也会返回 false。

还要熟悉instanceof- 它有时会变得非常有帮助。

另一个编辑：您的代码根本没有数据验证。没有地方可以检查未初始化的变量、卸载的对象或执行错误的方法。您应该始终在代码中使用它们。

php - PHP Dom 抓取大量数据

1 回答 1

Related

Reference