2

我有一个看起来像这样的大数据文件:

    1   6
    1   6
    2   7
    3   2
    3   6
    1   7
    1   9
    2   9
    1   5
    3   9
    3   1
    2   8

我想按第一列对数据进行分组,找到每个第一列值的第二列平均值,然后按第二列平均值对这些分组进行排序。所以输出应该是:

    2   8
    1   6.6
    3   4.5

我的代码现在看起来像这样,并且不起作用:

    CREATE EXTERNAL TABLE as (a STRING, b INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION 's3n://myfolder/hive';

    CREATE EXTERNAL TABLE output(a STRING, avgb DOUBLE)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION 's3n://myfolder/hive';

    load data inpath "s3n://myfolder/file.txt" into TABLE as;
    insert overwrite output select a, avg(b) from as group by a order by avg(b) DESC limit 1000;

我应该注意,以下确实有效,但在 SQL 中对我有用的 order by 和 insert 步骤不起作用:

    select a, avg(b) from as group by a;

当我尝试:

    select a, avg(b) from as group by a order by avg(b);

我得到“失败:语义分析错误:第 1:66 行无效的表别名或列引用‘b’:(可能的列名是:_col0、_col1)。

4

1 回答 1

4

只需在子查询中将其移出:

select a
from (select a, avg(b) as avgb from as group by a) as t
order by avgb;
于 2013-04-28T09:48:19.960 回答