python - hive中python udf的“选择转换”解决方案

问问题 2017-01-23T06:30:28.337

978 次

有没有办法不将所有列都包含在 select transform() 中，而在输出中获取所有列？

例如：我在 hive 表中有列，例如：

c1, c2, c3, c4, c5, c6, c7, c8, c9, c10

我正在对列执行转换c8, c9, c10，输出包含c1, c2, c3, c4, c5, c6, c7, cowhere co= output 在对列执行转换后c8, c9, c10

有一种方法可以做到这一点：

select transform (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10)
using 'python udf_name'
as (c1,c2,c3,c4,c5,c6,c7,co)
from table_name;

问题是我不想在选择转换中传递所有列，因为我的表中有近 900 列，而且很难弄清楚 UDF 在哪些列上起作用。

例子：

#temp
c1, c2, c3, c4  
 a,  1,  0, 5  
 b,   ,  8, 9

现在我想从列中找到第一个非零非空值c2, c3, c4 并用列 c1 打印它

这是python UDF

测试.py：

import sys
for line in sys.stdin:
    line=line.strip()
    c=line.split()
    l=len(c)
    for i in range (1,l):
        try:
            if (int(c[i])==0):
                pass
            else:
                print c[i]
                break
        except ValueError:
            pass

我可以通过传递所有列来实现这一点

select transform (c1,c2,c3,c4)
using 'python test.py'
as (c1,co)
from temp

输出：

c1, co  
 a,  1  
 b,  8

问题：我不想在选择转换中传递所有列，因为我有 900 列。

基本上，我只想传递 UDF 中涉及的那些列，而不是所有列。

python - hive中python udf的“选择转换”解决方案

0 回答 0

Related

Reference