1

我有关于企业的长格式数据,每次移动到不同位置时都有一行,以企业 ID 为键——任何一个企业机构都可以有多个移动事件。

tablefunc我希望重塑为广泛的格式,这通常是每个模块的交叉表区域。

+-------------+-----------+---------+---------+
| business_id | year_move |  long   |   lat   |
+-------------+-----------+---------+---------+
|   001013580 |      1991 | 71.0557 | 42.3588 |
|   001015924 |      1993 | 71.0728 | 42.3504 |
|   001015924 |      1996 | -122.28 | 37.654  |
|   001020684 |      1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+

然后我像这样转换:

SELECT longbyyear.*
FROM crosstab($$
    SELECT 
    business_id, 
    year_move, 
    max(longitude::float)
    from business_moves
    where year_move::int between 1991 and 2010 
    group by business_id, year_move
    order by business_id, year_move;
    $$
) 
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float, 
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);

它——主要是——让我得到想要的输出。

+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id  |  long91  |  long92  |  long93  | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … |        |        |        |
| 1000678 | 118.224  |          |          |        | … |        |        |        |
| 1002158 | 121.98   |          |          |        | … |        |        |        |
| 1004092 | 71.2384  |          |          |        | … |        |        |        |
| 1007801 | 118.0312 |          |          |        | … |        |        |        |
| 1007855 | 71.1769  |          |          |        | … |        |        |        |
| 1008697 | 71.0394  | 71.0358  |          |        | … |        |        |        |
| 1008986 | 71.1013  |          |          |        | … |        |        |        |
| 1009617 | 119.9965 |          |          |        | … |        |        |        |
+---------+----------+----------+----------+--------+---+--------+--------+--------+

唯一的障碍是,理想情况下我会为每年填充值,而不仅仅是移动年份的值。因此,所有字段都将被填充,每年都有一个值,最近的地址会延续到下一年。如果每个都是空白的,我可以通过手动更新来解决这个问题,使用上一列,我只是想知道是否有一种聪明的方法可以使用该crosstab()功能或​​其他方式,可能与自定义功能相结合。

4

2 回答 2

2

为了获取任何给定年份的每个 business_id 的当前位置,您需要两件事:

  1. 用于选择年份的参数化查询,作为 SQL 语言函数实现。
  2. 按年汇总,按业务 ID 分组,并保持坐标不变的肮脏技巧。这是通过 CTE 中的子查询完成的。

然后该函数如下所示:

CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
  WITH last_move AS (
    SELECT business_id, MAX(year_move) AS yr
    FROM business_moves
    WHERE year_move <= $1
    GROUP BY business_id)
  SELECT lm.business_id, $1::int AS yr, longitude, latitude
  FROM business_moves bm, last_move lm
  WHERE bm.business_id = lm.business_id
  AND bm.year_move = lm.yr;
$$ LANGUAGE sql;

子查询仅选择每个营业地点的最新移动。然后,主查询添加经度和纬度列,并将请求的年份放在返回的表中,而不是最近发生移动的年份。一个警告:您需要在此表中有一条记录,该记录提供每个 business_id 的建立和初始位置,否则直到它移动到其他地方后才会显示。

用通常的方法调用这个函数SELECT * FROM business_location_in_year_x(1997)。另请参阅SQL 小提琴

如果您确实需要交叉表,那么您可以调整此代码,为您提供多年的业务位置,然后将其输入到crosstab()函数中。

于 2014-05-08T06:39:30.760 回答
2

我假设您有每个业务移动的实际日期,因此我们可以每年做出有意义的选择

CREATE TEMP TABLE business_moves (
  business_id int,  -- why would you use inefficient varchar here?
  move_date date,
  longitude float,
  latitude float);

在此基础上,一个更有意义的测试用例:

INSERT INTO business_moves VALUES 
  (001013580, '1991-1-1', 71.0557, 42.3588),
  (001015924, '1993-1-1', 71.0728, 42.3504),
  (001015924, '1993-3-3', 73.0728, 43.3504),  -- 2nd move this year
  (001015924, '1996-1-1', -122.28, 37.654),
  (001020684, '1992-1-1', 84.3381, 33.5775);

完整、非常快速的解决方案

SELECT *
FROM crosstab($$
   SELECT business_id, year
        , first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
   FROM  (
      SELECT *
           , count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
      FROM  (SELECT DISTINCT business_id FROM business_moves) b
      CROSS  JOIN generate_series(1991, 2010) year
      LEFT   JOIN (
         SELECT DISTINCT ON (1,2)
                business_id
              , EXTRACT('year' FROM move_date)::int AS year
              , point(longitude, latitude) AS x
         FROM   business_moves
         WHERE  move_date >= '1991-1-1'
         AND    move_date <  '2011-1-1'
         ORDER  BY 1,2, move_date DESC
         ) bm USING (business_id, year)
      ) sub
   $$
   ,'VALUES
    (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
   ,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
    ) AS t(biz_id int
         , x91 point, x92 point, x93 point, x94 point, x95 point
         , x96 point, x97 point, x98 point, x99 point, x00 point
         , x01 point, x02 point, x03 point, x04 point, x05 point
         , x06 point, x07 point, x08 point, x09 point, x10 point);

结果:

 biz_id  |        x91        |        x92        |        x93        |        x94        |        x95        |        x96        |        x97        ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
 1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
 1015924 |                   |                   | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654)  | (-122.28,37.654)  ...
 1020684 |                   | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...

一步步

第1步

修复你所拥有的:

SELECT *
FROM crosstab($$
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date) AS year
        , point(longitude, latitude) AS long_lat
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   $$
   ,'VALUES
    (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
   ,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
   ) AS t(biz_id int
        , x91 point, x92 point, x93 point, x94 point, x95 point
        , x96 point, x97 point, x98 point, x99 point, x00 point
        , x01 point, x02 point, x03 point, x04 point, x05 point
        , x06 point, x07 point, x08 point, x09 point, x10 point);
  • 你想让 lat & lon 使它有意义,所以point从两者中形成 a。或者,您可以只连接一个text表示。

  • 您可能需要更多数据。使用DISTINCT ON而不是max()获取每年最新(完整)的行。此处的详细信息:
    选择每个 GROUP BY 组中的第一行?

  • 只要整个网格可能存在缺失值,您就必须使用crosstab()带有两个参数的变体。此处详细说明:
    PostgreSQL Crosstab Query

  • 调整了函数以使用move_date date而不是year_move.

第2步

解决您的要求:

理想情况下,我会为每年填充值

使用 a 个企业和年份构建完整的值网格(每个企业和年份一个单元格)CROSS JOIN

SELECT *
FROM  (SELECT DISTINCT business_id FROM business_moves) b
CROSS  JOIN generate_series(1991, 2010) year
LEFT   JOIN (
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date)::int AS year
        , point(longitude, latitude) AS x
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   ) bm USING (business_id, year)
  • 年份的集合来自一个generate_series()电话。

  • 区别于一个单独的业务SELECT。你可能有一张企业表,你可以用它来代替(而且更便宜)?这也将解释从未搬家的企业。

  • LEFT JOIN到每年实际的业务变动以达到完整的价值网格

第 3 步

填写默认值:

最近的地址结转到下一年。

SELECT business_id, year
     , COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
               ,'(0,0)') AS x
FROM  (
   SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
   FROM  (SELECT DISTINCT business_id FROM business_moves) b
   CROSS  JOIN generate_series(1991, 2010) year
   LEFT   JOIN (
      SELECT DISTINCT ON (1,2)
             business_id
           , EXTRACT('year' FROM move_date)::int AS year
           , point(longitude, latitude) AS x
      FROM   business_moves
      WHERE  move_date >= '1991-1-1'
      AND    move_date <  '2011-1-1'
      ORDER  BY 1,2, move_date DESC
      ) bm USING (business_id, year)
   ) sub;
  • sub基于步骤 2 的查询构建的子查询中,形成grp共享相同位置的单元格组 ( )。

    为此,使用众所周知的聚合函数count()作为窗口聚合函数。NULL 值不计算在内,因此该值会随着每次实际移动而增加,从而形成共享相同位置的单元组。

  • 在外部查询中,使用窗口函数为同一组中的每一行选择每组的第一个值first_value()。瞧。

  • 最重要的是,可选(!)将其包装起来COALESCE以用未知位置(尚未移动)填充剩余的单元格(0,0)。如果这样做,则没有剩余NULL值,您可以使用更简单的crosstab(). 那是口味问题。

SQL Fiddle与基本查询。crosstab()当前未在 SQL Fiddle 上安装。

第4步

在更新的crosstab()调用中使用步骤 3 中的查询。
总而言之,这应该尽可能。索引可能会有所帮助。

于 2014-05-11T04:06:11.847 回答