我有一个表模式,其中包括一个 int 数组列,以及一个对数组内容求和的自定义聚合函数。换句话说,给定以下内容:
CREATE TABLE foo (stuff INT[]);
INSERT INTO foo VALUES ({ 1, 2, 3 });
INSERT INTO foo VALUES ({ 4, 5, 6 });
我需要一个可以返回的“sum”函数{ 5, 7, 9 }。正常工作的PL/pgSQL版本如下:
CREATE OR REPLACE FUNCTION array_add(array1 int[], array2 int[]) RETURNS int[] AS $$
DECLARE
    result int[] := ARRAY[]::integer[];
    l int;
BEGIN
  ---
  --- First check if either input is NULL, and return the other if it is
  ---
  IF array1 IS NULL OR array1 = '{}' THEN
    RETURN array2;
  ELSEIF array2 IS NULL OR array2 = '{}' THEN
    RETURN array1;
  END IF;
  l := array_upper(array2, 1);
  SELECT array_agg(array1[i] + array2[i]) FROM generate_series(1, l) i INTO result;
  RETURN result;
END;
$$ LANGUAGE plpgsql;
加上:
CREATE AGGREGATE sum (int[])
(
    sfunc = array_add,
    stype = int[]
);
使用大约 150,000 行的数据集,SELECT SUM(stuff)需要 15 秒以上才能完成。
然后我用 C 重写了这个函数,如下:
#include <postgres.h>
#include <fmgr.h>
#include <utils/array.h>
Datum array_add(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(array_add);
/**
 * Returns the sum of two int arrays.
 */
Datum
array_add(PG_FUNCTION_ARGS)
{
  // The formal PostgreSQL array objects:
  ArrayType *array1, *array2;
  // The array element types (should always be INT4OID):
  Oid arrayElementType1, arrayElementType2;
  // The array element type widths (should always be 4):
  int16 arrayElementTypeWidth1, arrayElementTypeWidth2;
  // The array element type "is passed by value" flags (not used, should always be true):
  bool arrayElementTypeByValue1, arrayElementTypeByValue2;
  // The array element type alignment codes (not used):
  char arrayElementTypeAlignmentCode1, arrayElementTypeAlignmentCode2;
  // The array contents, as PostgreSQL "datum" objects:
  Datum *arrayContent1, *arrayContent2;
  // List of "is null" flags for the array contents:
  bool *arrayNullFlags1, *arrayNullFlags2;
  // The size of each array:
  int arrayLength1, arrayLength2;
  Datum* sumContent;
  int i;
  ArrayType* resultArray;
  // Extract the PostgreSQL arrays from the parameters passed to this function call.
  array1 = PG_GETARG_ARRAYTYPE_P(0);
  array2 = PG_GETARG_ARRAYTYPE_P(1);
  // Determine the array element types.
  arrayElementType1 = ARR_ELEMTYPE(array1);
  get_typlenbyvalalign(arrayElementType1, &arrayElementTypeWidth1, &arrayElementTypeByValue1, &arrayElementTypeAlignmentCode1);
  arrayElementType2 = ARR_ELEMTYPE(array2);
  get_typlenbyvalalign(arrayElementType2, &arrayElementTypeWidth2, &arrayElementTypeByValue2, &arrayElementTypeAlignmentCode2);
  // Extract the array contents (as Datum objects).
  deconstruct_array(array1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1,
&arrayContent1, &arrayNullFlags1, &arrayLength1);
  deconstruct_array(array2, arrayElementType2, arrayElementTypeWidth2, arrayElementTypeByValue2, arrayElementTypeAlignmentCode2,
&arrayContent2, &arrayNullFlags2, &arrayLength2);
  // Create a new array of sum results (as Datum objects).
  sumContent = palloc(sizeof(Datum) * arrayLength1);
  // Generate the sums.
  for (i = 0; i < arrayLength1; i++)
  {
    sumContent[i] = arrayContent1[i] + arrayContent2[i];
  }
  // Wrap the sums in a new PostgreSQL array object.
  resultArray = construct_array(sumContent, arrayLength1, arrayElementType1, arrayElementTypeWidth1, arrayElementTypeByValue1, arrayElementTypeAlignmentCode1);
  // Return the final PostgreSQL array object.
  PG_RETURN_ARRAYTYPE_P(resultArray);
}
这个版本只需要 800 毫秒就可以完成,这……好多了。
(在此处转换为独立扩展:https ://github.com/ringerc/scrapcode/tree/master/postgresql/array_sum )
我的问题是,为什么 C 版本的速度这么快? 我预计会有改进,但 20 倍似乎有点多。这是怎么回事?在 PL/pgSQL 中访问数组是否存在固有的缓慢问题?
我在 Fedora Core 8 64 位上运行 PostgreSQL 9.0.2。该机器是一个 High-Memory Quadruple Extra-Large EC2 实例。