php - Handling large datasets with PHP/Drupal

Question

I have a report page that deals with ~700k records from a database table. I can display this on a webpage using paging to break up the results. However, my export to PDF/CSV functions rely on processing the entire data set at once and I'm hitting my 256MB memory limit at around 250k rows.

I don't feel comfortable increasing the memory limit and I haven't got the ability to use MySQL's save into outfile to just serve a pre-generated CSV. However, I can't really see a way of serving up large data sets with Drupal using something like:

$form = array();
$table_headers = array();
$table_rows = array();
$data = db_query("a query to get the whole dataset");
while ($row = db_fetch_object($data)) {
    $table_rows[] = $row->some attribute;
}

$form['report'] = array('#value' => theme('table', $table_headers, $table_rows);
return $form;

Is there a way of getting around what is essentially appending to a giant array of arrays? At the moment I don't see how I can offer any meaningful report pages with Drupal due to this.

Thanks

score 4 · Accepted Answer

有了这么大的数据集，我会使用 Drupal 的 Batch API，它允许将时间密集型操作分成批次。这对用户来说也更好，因为它会给他们一个进度条，其中包含一些操作需要多长时间的指示。

通过打开一个临时文件开始批处理操作，然后在每个新批处理上附加新记录，直到完成。最后一页可以做最后的处理，将数据作为 cvs 交付或转换为 PDF。您可能还想添加一些清理后记。

http://api.drupal.org/api/group/batch/6

score 1 · Accepted Answer

如果您正在生成 PDF 或 CSV，则不应使用 Drupal 原生函数。在你的while循环中写入输出文件怎么样？这样，在给定时间只有一个结果集在内存中。

score 0 · Accepted Answer

目前，您将所有内容存储在数组中$table_rows。

当您从数据库中读取它时（例如每隔这么多行），您不能刷新至少部分报告以释放一些内存吗？我不明白为什么只能一次写入 csv。

score 0 · Accepted Answer

要记住的另一件事是，在 PHP5（5.3 之前）中，将数组分配给新变量或将其传递给函数会复制数组并且不会创建引用。您可能正在创建相同数据的许多副本，如果没有一个未设置或超出范围，则无法对它们进行垃圾收集以释放内存。在可能的情况下，使用引用对原始数组执行操作可以节省内存

function doSomething($arg){
  foreach($arg AS $var)
    // a new copy is created here internally: 3 copies of data exist
    $internal[] = doSomethingToValue($var);
  return $internal;
  // $arg goes out of scope and can be garbage collected: 2 copies exist
}
$var = array();
// a copy is passed to function: 2 copies of data exist
$var2 = doSomething($var);
// $var2 will be a reference to the same object in memory as $internal, 
//  so only 2 copies still exist

如果 $var 设置为函数的返回值，旧值可以被垃圾回收，但直到赋值之后，所以在短时间内仍然需要更多内存

function doSomething(&$arg){
  foreach($arg AS &$var)
    // operations are performed on original array data: 
    // only two copies of an array element exist at once, not the whole array
    $var = doSomethingToValue($var);  
  unset($var); // not needed here, but good practice in large functions
}
$var = array();
// a reference is passed to function: 1 copy of data exists
doSomething($var);

score 0 · Accepted Answer

我处理如此庞大的报告的方法是使用 php cli/Java/CPP/C#（即 CRONTAB）+ 使用 mysql 的无缓冲查询选项生成它们。
在磁盘上完成文件/报告创建后，您可以提供指向它的链接...

score 0 · Accepted Answer

您应该使用 pager_query 将分页包含在其中，并将结果分成每页 50-100 个。这应该有很大帮助。你说你想使用分页，但我没有在代码中看到它。

看看这个：http ://api.drupal.org/api/function/pager_query/6

score 0 · Accepted Answer

我对增加内存限制感到不舒服

增加内存限制并不意味着每个 php 进程都会使用该内存量。但是，您可以使用自定义内存限制执行 php 的 cli 版本 - 但这也不是正确的解决方案......

而且我无法使用 MySQL 的 save into outfile 来提供预先生成的 CSV

然后不要将其全部保存在数组中 - 当您从数据库中获取每一行时将其写入输出缓冲区（IIRC 整个结果集都在有限的 php 内存之外缓冲）。或者直接将其写入文件，然后在文件完成并关闭时进行重定向。

C。

php - Handling large datasets with PHP/Drupal

7 回答 7

Related

Reference