hive - 在 hive 中，如何在 2 行之间进行计算？

Question

我有这张桌子。

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |    val2      | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |      1       |
   |     B       |       1       |       3       |      5       |
   |     A       |       2       |       6       |      7       |
   |     B       |       2       |      10       |     12       |
   |     A       |       4       |       6       |      7       |
   |     B       |       4       |      20       |     26       |
   +------------------------------------------------------------+

我想要得到的是每一行，

ks |  time |  val1 | val1 of next ts of same ks  |

需要明确的是，上述示例的结果应该是，

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |   next.val1  | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |       6      |
   |     B       |       1       |       3       |       10     |
   |     A       |       2       |       6       |       6      |
   |     B       |       2       |      10       |       20     |
   |     A       |       4       |       6       |      null    |
   |     B       |       4       |      20       |      null    |
   +------------------------------------------------------------+

（对于 value2 我也需要同样的下一个）

我尝试了很多来为此提出一个蜂巢查询，但仍然没有运气。如here （Quassnoi's answer）所述，我能够在sql中为此编写查询，但无法在hive中创建等效项，因为hive不支持select中的子查询。

有人可以帮我实现这一目标吗？

提前致谢。

编辑：

我试过的查询是，

SELECT ks, time, val1, next[0] as next.val1 from
(SELECT ks, time, val1
       COALESCE(
       (
       SELECT Val1, time
       FROM myTable mi
       WHERE mi.val1 > m.val1 AND mi.ks = m.ks
       ORDER BY time
       LIMIT 1
       ), CAST(0 AS BIGINT)) AS next
FROM  myTable m
ORDER BY time) t2;

score 2 · Accepted Answer

Your query seems quite similar to the "year ago" reporting that is ubiquitous in financial reporting. I think a LEFT OUTER JOIN is what you are looking for.

We join table myTable to itself, naming the two instances of the same table m and n. For every entry in the first table m we will attempt to find a matching record in n with the same ks value but an incremented value of time. If this record does not exist, all column values for n will be NULL.

SELECT 
    m.ks, 
    m.time,
    m.val1, 
    n.val1 as next_val1,
    m.val2, 
    n.val2 as next_val2
FROM 
    myTable m
LEFT OUTER JOIN
    myTable n
ON (
    m.ks = n.ks
AND 
    m.time + 1 = n.time
);

Returns the following.

ks  time  val1  next_val1  val2  next_val2
A   1     1     6          1     7
A   2     6     6          7     7
A   3     6     NULL       7     NULL
B   1     3     10         5     12
B   2     10    20         12    26
B   3     20    NULL       26    NULL

Hope that helps.

score 2 · Accepted Answer

我发现使用 Hive 自定义 map/reduce 功能可以很好地解决与此类似的查询。它使您有机会考虑一组输入并“减少”一个（或多个）结果。

这个答案讨论了解决方案。

关键是您使用CLUSTER BY将所有具有相似键值的结果发送到同一个reducer，因此相同的reduce脚本，相应地收集，然后在key更改时输出reduced结果，并开始收集新的key。

hive - 在 hive 中，如何在 2 行之间进行计算？

2 回答 2

Related

Reference