1

我有一个包含用户代理字符串(我将其解析为browserosdevice列)和 cityid的表。我要计算最流行的browserosdevice为每个city

这是我的尝试:

select device os, browser, name, MAX(hits) as pop from 
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits 
from (select * from browserdata join citydata on cityid=id) t 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser 
GROUP BY uap.device, uap.os, uap.browser, name) t2 
GROUP BY name;

所以,最里面的子查询,别名t只是将我的表连接到另一个将id's 映射到 city的表上name,所以我可以在输出中看到实际name的 s,而不是 city id

然后,名为的子查询t2计算复合键(device, browser, os, city)的数量。外部查询将所有内容分组到name窗口中并提取具有最大用户数的行。

我得到的错误是这样的:

失败:SemanticException [错误 10025]:第 1:7 行表达式不在 GROUP BY 键“设备”中

我明白这意味着什么。它说我需要包含devicegroup by中,但如果我这样做,那么它将不会计算我想要的。如何修复我的查询?

另外,我注意到我的一些 hive 查询在 mapreduce 上运行,但不在 tez 上运行。这是为什么?

4

2 回答 2

1

使用分析函数可以消除不必要的连接:

WITH 
t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)

select cityname, maximum,  device, os, browser
 from
     (select cityname, device, browser, os, 
             max(count) over(partition by cityname)                         as maximum,
             dense_rank() over (partition by cityname order by count desc ) as rnk      
      from t3
     ) s  where rnk =1 
;
于 2018-12-24T15:21:42.297 回答
0
WITH t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
  (SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),

t4 as
    (select cityname, MAX(count) as maximum from t3 group by cityname)

select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;

这行得通,但是我想知道是否有办法优化它...

于 2018-12-24T14:30:25.077 回答