您可以使用以下方法获取仅观看一个类别的用户:
select user_id, min(category) as category, sum(view_hour)
from kcp_01
group by user_id
having min(category) = max(category);
注意min(category)是类别,因为只有一个类别。
在 Hive 中,您可能需要执行以下操作:
select user_id, min(category) as min_category,
max(category) as max_category, sum(view_hour)
from kcp_01
group by user_id
having min_category = max_category;
如果您想要每个类别的此类用户的数量,那么子查询会很有帮助:
select category, count(*) as number_single_users
from (select user_id, min(category) as min_category, max(category) as max_category, sum(view_hour) as view_hours
from kcp_01
group by user_id
having min_category) = max_category
) u
group by category;
另一种方法是简单地使用not exists:
select k.category, count(distinct k.user_id)
from kcp_01 k
where not exists (select 1
from kcp_01 k2
where k2.user_id = k.user_id and
k2.category <> k.category
)
group by k.category;
我记得 - 曾几何时 - Hive 难以处理count(distinct),因此两种group by方法可能效果更好。