1

我有一张客户交易表,其中客户购买的每件商品都存储为一行。因此,对于单个事务,表中可以有多行。我有另一个名为visit_date的列。有一个名为cal_month_nbr的类别列,其范围为 1 到 12,具体取决于发生交易的月份。

数据如下所示

Id          visit_date     Cal_month_nbr
----        ------          ------
1           01/01/2020      1
1           01/02/2020      1
1           01/01/2020      1
2           02/01/2020      2
1           02/01/2020      2
1           03/01/2020      3
3           03/01/2020      3

首先,我想知道客户每月使用他们的 visit_date 访问多少次,即我想要低于输出

id    cal_month_nbr       visit_per_month
---        ---------     ----
1           1             2
1           2             1
1           3             1
2           2             1
3           3             1

每个 id 的平均访问频率是多少,即。

id            Avg_freq_per_month
----          -------------
1              1.33
2              1
3              1

我尝试使用以下查询,但它将每个项目计为一个事务

select avg(count_e) as num_visits_per_month,individual_id
from
(
    select r.individual_id, cal_month_nbr, count(*) as count_e
 from 
  ww_customer_dl_secure.cust_scan 
         GROUP  by 
         r.individual_id, cal_month_nbr
         order by count_e desc
         ) as t
         group by individual_id

我将不胜感激任何帮助、指导或建议

4

1 回答 1

1

您可以将总访问次数除以月数:

select individual_id,
       count(*) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

如果您想要每月的平均天数,那么:

select individual_id,
       count(distinct visit_date) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

实际上,Hive 在计算 时可能效率不高count(distinct),因此多级聚合可能更快:

select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
      from (select distinct individual_id, visit_date, cal_month_nbr
            from ww_customer_dl_secure.cust_scan c
           ) iv 
      group by individual_id, cal_month_nbr
     ) ic
group by individual_id;
于 2020-04-13T20:02:52.820 回答