0

我想知道我们平台的平均每日不同用户数量。限制是我必须在我们为BigQuery创建SQL的BI工具(Looker)中实现这一点,所以我只能将一些自定义sql代码放入select语句中,而不能随意创建查询。

我找到了一个适用于少量数据的解决方案,但是当我扩展它时,数组的硬限制 100MB 会引发错误。

连接和拆分是为了减小数组大小。我首先使用了 STRUCT(id, date),但您不能将 DISTINCT 与 STRUCT 一起使用。

文件大小的问题没有出现在函数中,我什至不能只使用 ARRAY_AGG(

CREATE TEMP FUNCTION trend_daily_avg(columns_arr ARRAY<STRING>) AS ((
    SELECT AVG(value)
    FROM (
        SELECT
            COUNT(DISTINCT columns_arr.value_column) as value,
        FROM (
            SELECT
                SPLIT(concatstring, " ")[SAFE_OFFSET(1)] as value_column,
                SPLIT(concatstring, " ")[SAFE_OFFSET(0)] as time_column,
            FROM UNNEST(columns_arr) concatstring
        ) columns_arr
        GROUP BY columns_arr.time_column
    )
));
WITH dummy_data as (
    SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL 
    SELECT "10-10-2021" as view_date, 0002 as full_visitor_id, "group-1" as hostname UNION ALL 
    SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL 
    SELECT "11-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL 
    SELECT "11-10-2021" as view_date, 0003 as full_visitor_id, "group-2" as hostname UNION ALL 
    SELECT "11-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL 
    SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL 
    SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname
)
SELECT
    hostname,
    count(distinct full_visitor_id) as users_dedup,
    trend_daily_avg(ARRAY_AGG( DISTINCT
        CONCAT(view_date, " ", full_visitor_id) IGNORE NULLS
    )) as average_trend, # This works for a small amount of data but not in production
    ARRAY_AGG( DISTINCT
        CONCAT(view_date, " ", full_visitor_id) IGNORE NULLS
    ) as average_trend, # This also doesnt work, the upper query fails at this part
FROM ( # Subselect cannot be touched as it cannot be integrated into the BI tool
    SELECT
        view_date,
        full_visitor_id,
        hostname, # More dimensions get dynamically added and then grouped
    FROM dummy_data
)
GROUP BY hostname;

我是否可以以某种方式增加 BigQuery 中的最大行大小,或者重写查询以便不必创建大量数组?

编辑:一种可行的解决方案是将每一天(对于天粒度)或一个月(对于每月粒度)分别添加到一个数组中。这绝对不是一个理想的解决方案,而且效率很低,但确实有效。有没有办法让这更有效?由于一个月的日期和 ID 大约有 30GB 的数据,并且在三年范围内拥有约 1000 个子查询是非常糟糕的。

CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
    SELECT 
        AVG(val) 
    FROM(
        SELECT val 
        FROM UNNEST(arr) val 
        where val > 0
    )
)
);
 
select
count(distinct id) as users_dedup,
avg_array([
    count(distinct case when day = '2021-01-01' then id else null end),
    count(distinct case when day = '2021-02-01' then id else null end),
    count(distinct case when day = '2021-03-01' then id else null end),
    count(distinct case when day = '2021-04-01' then id else null end),
    count(distinct case when day = '2021-05-01' then id else null end),
    count(distinct case when day = '2021-06-01' then id else null end),
    count(distinct case when day = '2021-07-01' then id else null end),
    count(distinct case when day = '2021-08-01' then id else null end),
    count(distinct case when day = '2021-09-01' then id else null end),
    count(distinct case when day = '2021-10-01' then id else null end),
    count(distinct case when day = '2021-11-01' then id else null end),
    count(distinct case when day = '2021-12-01' then id else null end)
]) as avg_monthly_users 

from (
select '123' as id, '2021-01-01' as day
union all
select '456' as id, '2021-02-01' as day
union all
select '123' as id, '2021-03-01' as day
)
4

1 回答 1

0

由于大数组,达到了单行大小的最大限制。我试图在不使用数组的情况下获得每日不同用户的平均数量,并且工作正常。

WITH dummy_data as (
   SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
   SELECT "10-10-2021" as view_date, 0002 as full_visitor_id, "group-1" as hostname UNION ALL
   SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
   SELECT "11-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL
   SELECT "11-10-2021" as view_date, 0003 as full_visitor_id, "group-2" as hostname UNION ALL
   SELECT "11-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
   SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL
   SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname
)
SELECT hostname, AVG(c) as avg_users FROM  (SELECT hostname , view_date, COUNT(DISTINCT full_visitor_id) as c
       FROM (# Subselect cannot be touched as it cannot be integrated into the BI tool
   SELECT
       view_date,
       full_visitor_id,
       hostname, # More dimensions get dynamically added and then grouped
   FROM dummy_data
)GROUP BY hostname, view_date) GROUP BY hostname

输出:

在此处输入图像描述

于 2021-10-16T10:41:14.047 回答