sql - 使用 hive udf 函数计算运行总和

Question

我是 Hive 的新手，我想事先原谅我对以下任何内容的无知。我有一张如下表：

SELECT a.storeid, a.smonth, a.sales FROM table a;
1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

我的目标输出如下：

1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0
1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0

我编写了一个简单的 hive udf sum 类来实现上述目的，并在查询中使用了 SORT BY storeid, smonth：

SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table SORT BY storeid, smonth) a;

显然，它不会产生上述输出，因为只有一个映射器并且调用了相同的 udf 实例，它会在总集上生成一个运行总和。我的目标是为每个 storeid 重置 udf 类中的 runningSum 实例变量，以便评估函数返回上述输出。我使用了以下方法： 1. 传递 storeid 变量 rsum(sales, storeid) 然后我们可以在 udf 类中正确处理这种情况。2. 使用 2 个映射器，如下查询：

set mapred.reduce.tasks=2;
SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table DISTRIBUTE BY storeid SORT BY storeid, smonth) a;

1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0
1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0

为什么 1002 总是出现在顶部？除了上述方法之外，我想就其他不同的方法（例如子查询/连接）寻求您的建议。此外，您建议的方法的时间复杂度是多少？

score 9 · Accepted Answer

Hive 提供了一种在单行中执行此操作的更好方法 -
请按照以下过程来实现您的目标输出

创建一个可以包含您的数据集的配置单元表 -

1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

现在只需在您的配置单元终端中运行以下命令 -

SELECT storeid, smonth, sales, SUM(sales) OVER (PARTITION BY storeid ORDER BY smonth) FROM table_name;

输出将像 -

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0

我希望这可以帮助您获得目标输出。

score 4 · Accepted Answer

或者，您可以查看包含多个功能扩展的 Hive 票证。
除其他外，还有一个累积和实现（GenericUDFSum）。

这个函数（称为“rsum”）有两个参数，id 的哈希值（记录在 reducer 之间的分区）和它们对应的值相加：

select t.storeid, t.smonth, t.sales, rsum(hash(t.storeid),t.sales) as sales_sum 
  from (select storeid, smonth, sales from sm distribute by hash(storeid) 
    sort by storeid, smonth) t;

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0

score 0 · Accepted Answer

SELECT storeid, smonth, sales, sum(sales) over(partition by storeid order by smonth) as rsum FROM table ;

score 0 · Accepted Answer

这应该可以解决问题：

SELECT 
    a.storeid, 
    a.smonth,
    a.sales,
    SUM(a.sales) 
OVER (
    PARTITION BY a.storeid 
    ORDER BY a.smonth asc 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM 
    table a;

来源：https ://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

sql - 使用 hive udf 函数计算运行总和

4 回答 4

Related

Reference