1

我必须生成一个报告,该报告将为我提供表 A、B 和 C 中使用 Hive 存储的事件的计数总和,并且我的 S3 存储桶已按 Organization_id 分区

例如: 表 A – 记录约翰(和其他员工)上班的每一天 表 B – 记录约翰(和其他员工)在工作中拨打或接听的每个电话 表 C – 记录每个约翰(和其他员工)在工作中提交的费用

基本上我想要上个月 John (employee_id) 的 A、B 和 C 计数的总和。如果在 3 个表 A、B 或 C 中的任何一个中有记录,则每个日期应该只有一个记录(如果一个或多个表中有一个日期的记录,则将计数相加)。所以我的输出是:

Employee id
Employee Name 
Date
Count
123
John
02-Jan-2016
55
123
John
12-Jan-2016
88
123
John
19-Jan-2016
103

我想出的查询是:

select  adcts.employee_name, adcts.employee_id,Total_count as event_count, adcts.event_date  
from   
       (select   coalesce(Evts.employee_id,imps.employee_id,AEvts.employee_id) as   employee_id  
        ,   coalesce(Evts.employee_name,imps.employee_name,AEvts.employee_name) as   employee_name  
        , coalesce(Evts.Event_count,0) + coalesce(Imps.Impression_count,0)   + coalesce (AEvts.Event_Count,0)as Total_Count  
        , coalesce (Evts.event_date,imps.impression_date, AEvts.event_date)   as event_date  
    from  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,event_date) Evts  
       full outer join  
        (select employee_id, employee_name, count(*) as Impression_count,   impression_date   
         from mm_impressions  
         where organization_id = 100048  
         and impression_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,impression_date) Imps  
        on Evts.employee_id = Imps.employee_id  
       full outer join  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_attributed_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
         and event_type = 'click'  
        group by employee_id, employee_name,event_date) AEvts  
     on AEvts.employee_id=Evts.employee_id  
       ) adcts     
join  
        (select distinct c.employee_id from default.t1_meta_dmp c   
         where c.employee_dmp_enabled='inherits'  
         and c.agency_dmp_enabled = 'inherits'  
         and c.agency_status='true'  
         and c.employee_status='true'  
         and c.organization_id = 100048) cc  
on adcts.employee_id=cc.employee_id  
order by adcts.employee_id asc  

我有两个问题:

1. 我有正确的查询吗?2. 因为我使用的是“完全外部联接”,所以我在同一日期获得了多个条目。有人可以提出更好的方法来实现结果吗?不同的查询可能

4

1 回答 1

0

您将获得多个相同的条目,date因为您date在子查询中分组,但仅通过 加入它们employee_id。这就是为什么您的记录在加入后重复的原因。您还应该添加event_date到连接条件。

看来你根本不需要FULL JOIN。加入比union all. 然后从每个表中使用 UNION ALL 选择group by employee_name, employee_id, event_date并聚合 count() :

select employee_id, employee_name, sum(Event_count) as Total_Count , event_date 
    from
    (
    select employee_id, employee_name, count(*) as Event_count, event_date  from mm_events 
    where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name, event_date

    union all  
    select employee_id, employee_name, count(*) as Event_count, impression_date as event_date   
    from mm_impressions
     where organization_id = 100048 and impression_date between '2016-02-01' and '2016-02-04' 
group by employee_id, employee_name,impression_date

    union all 
    select employee_id, employee_name, count(*) as Event_count,event_date  
    from mm_attributed_events 
    where organization_id = 100048  and event_date between '2016-02-01' and '2016-02-04'  and event_type = 'click'
group by employee_id, employee_name, event_date
    ) adcts
    group by employee_id, employee_name, event_date

将您的加入与 cc 查询添加到上述查询中。

UNION ALL 中的所有子查询将并行运行

于 2017-03-19T10:08:42.400 回答