2

我有这张桌子。

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |    val2      | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |      1       |
   |     B       |       1       |       3       |      5       |
   |     A       |       2       |       6       |      7       |
   |     B       |       2       |      10       |     12       |
   |     A       |       4       |       6       |      7       |
   |     B       |       4       |      20       |     26       |
   +------------------------------------------------------------+

我想要得到的是每一行,

ks |  time |  val1 | val1 of next ts of same ks  |

需要明确的是,上述示例的结果应该是,

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |   next.val1  | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |       6      |
   |     B       |       1       |       3       |       10     |
   |     A       |       2       |       6       |       6      |
   |     B       |       2       |      10       |       20     |
   |     A       |       4       |       6       |      null    |
   |     B       |       4       |      20       |      null    |
   +------------------------------------------------------------+

(对于 value2 我也需要同样的下一个)

我尝试了很多来为此提出一个蜂巢查询,但仍然没有运气。如here (Quassnoi's answer)所述,我能够在sql中为此编写查询,但无法在hive中创建等效项,因为hive不支持select中的子查询。

有人可以帮我实现这一目标吗?

提前致谢。

编辑:

我试过的查询是,

SELECT ks, time, val1, next[0] as next.val1 from
(SELECT ks, time, val1
       COALESCE(
       (
       SELECT Val1, time
       FROM myTable mi
       WHERE mi.val1 > m.val1 AND mi.ks = m.ks
       ORDER BY time
       LIMIT 1
       ), CAST(0 AS BIGINT)) AS next
FROM  myTable m
ORDER BY time) t2;
4

2 回答 2

2

Your query seems quite similar to the "year ago" reporting that is ubiquitous in financial reporting. I think a LEFT OUTER JOIN is what you are looking for.

We join table myTable to itself, naming the two instances of the same table m and n. For every entry in the first table m we will attempt to find a matching record in n with the same ks value but an incremented value of time. If this record does not exist, all column values for n will be NULL.

SELECT 
    m.ks, 
    m.time,
    m.val1, 
    n.val1 as next_val1,
    m.val2, 
    n.val2 as next_val2
FROM 
    myTable m
LEFT OUTER JOIN
    myTable n
ON (
    m.ks = n.ks
AND 
    m.time + 1 = n.time
);

Returns the following.

ks  time  val1  next_val1  val2  next_val2
A   1     1     6          1     7
A   2     6     6          7     7
A   3     6     NULL       7     NULL
B   1     3     10         5     12
B   2     10    20         12    26
B   3     20    NULL       26    NULL

Hope that helps.

于 2013-05-16T12:27:39.040 回答
2

我发现使用 Hive 自定义 map/reduce 功能可以很好地解决与此类似的查询。它使您有机会考虑一组输入并“减少”一个(或多个)结果。

这个答案讨论了解决方案。

关键是您使用CLUSTER BY将所有具有相似键值的结果发送到同一个reducer,因此相同的reduce脚本,相应地收集,然后在key更改时输出reduced结果,并开始收集新的key。

于 2013-05-16T13:57:38.830 回答