apache-spark - Hive TRANSFORM 接收连接数组值的 NULL

Question

我有一个格式为的蜂巢表：

   col1.      col2.     col3.
    a1          b1       c1
    a1          b1       c2                                  
    a1          b2       c2
    a1          b2       c3              
    a2          b3       c1
    a2          b4       c1                                  
    a2          b4       c2
    a2          b4       c3              
    .
    .

col1中的每个值都可以在col2中具有多个值，并且每一对(col1, col2)都可以具有col3的多个值。

我正在运行查询[Q]：

select col1, col2, collect_list(col3) from {table} group by col1, col2;

要得到：

a1   b1   [c1, c2]
a1   b2   [c2, c3]
a2   b3   [c1]
a2   b4   [c1, c2, c3]

我想使用 python UDF 进行一些转换。所以我使用 TRANSFORM 子句将所有这些列传递给 UDF：

select TRANSFORM ( * ) using 'python udf.py' FROM 
(
select col1, col2, concat_ws('\t', collect_list(col3)) from {table} group by col1, col2;
)

我正在使用 concat_ws 将数组输出转换为由分隔符连接的 collect_list 中的字符串。我得到 col1, col2 结果，但没有得到 col3 输出。

+---------+---------+
|      key|    value|
+---------+---------+
|a1       | b1      |
|         |     null|
|a1       | b2      |
|         |     null|
|a2       | b3      |
|         |     null|
|a2       | b4      |
|         |     null|
+---------+---------+

在我的 UDF 中，我只有一个打印语句，用于打印从标准输入接收到的行。

import sys
for line in sys.stdin:
    try:
        print line
    except Exception as e:
        continue

有人可以帮助弄清楚为什么我的 UDF 中没有 col3 吗？

score 1 · Accepted Answer

首先，您需要解析 Python UDF 中的行，例如，

import sys
for line in sys.stdin:
    try:
        line = line.strip('\n')
        col1, col2, col3 = line.split('\t')
        print '\t'.join([col1, col2, col3])
    except Exception as e:
        continue

那么最好使用别的东西而不是\tconcat_ws

select TRANSFORM ( * )  using 'python udf.py' as (col1, col2, col3)
FROM 
(
select col1, col2, concat_ws(',', collect_list(col3)) from {table} group by col1, col2;

apache-spark - Hive TRANSFORM 接收连接数组值的 NULL

1 回答 1

Related

Reference