我想知道我们平台的平均每日不同用户数量。限制是我必须在我们为BigQuery创建SQL的BI工具(Looker)中实现这一点,所以我只能将一些自定义sql代码放入select语句中,而不能随意创建查询。
我找到了一个适用于少量数据的解决方案,但是当我扩展它时,数组的硬限制 100MB 会引发错误。
连接和拆分是为了减小数组大小。我首先使用了 STRUCT(id, date),但您不能将 DISTINCT 与 STRUCT 一起使用。
文件大小的问题没有出现在函数中,我什至不能只使用 ARRAY_AGG(
CREATE TEMP FUNCTION trend_daily_avg(columns_arr ARRAY<STRING>) AS ((
SELECT AVG(value)
FROM (
SELECT
COUNT(DISTINCT columns_arr.value_column) as value,
FROM (
SELECT
SPLIT(concatstring, " ")[SAFE_OFFSET(1)] as value_column,
SPLIT(concatstring, " ")[SAFE_OFFSET(0)] as time_column,
FROM UNNEST(columns_arr) concatstring
) columns_arr
GROUP BY columns_arr.time_column
)
));
WITH dummy_data as (
SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
SELECT "10-10-2021" as view_date, 0002 as full_visitor_id, "group-1" as hostname UNION ALL
SELECT "10-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
SELECT "11-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL
SELECT "11-10-2021" as view_date, 0003 as full_visitor_id, "group-2" as hostname UNION ALL
SELECT "11-10-2021" as view_date, 0001 as full_visitor_id, "group-1" as hostname UNION ALL
SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname UNION ALL
SELECT "12-10-2021" as view_date, 0002 as full_visitor_id, "group-2" as hostname
)
SELECT
hostname,
count(distinct full_visitor_id) as users_dedup,
trend_daily_avg(ARRAY_AGG( DISTINCT
CONCAT(view_date, " ", full_visitor_id) IGNORE NULLS
)) as average_trend, # This works for a small amount of data but not in production
ARRAY_AGG( DISTINCT
CONCAT(view_date, " ", full_visitor_id) IGNORE NULLS
) as average_trend, # This also doesnt work, the upper query fails at this part
FROM ( # Subselect cannot be touched as it cannot be integrated into the BI tool
SELECT
view_date,
full_visitor_id,
hostname, # More dimensions get dynamically added and then grouped
FROM dummy_data
)
GROUP BY hostname;
我是否可以以某种方式增加 BigQuery 中的最大行大小,或者重写查询以便不必创建大量数组?
编辑:一种可行的解决方案是将每一天(对于天粒度)或一个月(对于每月粒度)分别添加到一个数组中。这绝对不是一个理想的解决方案,而且效率很低,但确实有效。有没有办法让这更有效?由于一个月的日期和 ID 大约有 30GB 的数据,并且在三年范围内拥有约 1000 个子查询是非常糟糕的。
CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
SELECT
AVG(val)
FROM(
SELECT val
FROM UNNEST(arr) val
where val > 0
)
)
);
select
count(distinct id) as users_dedup,
avg_array([
count(distinct case when day = '2021-01-01' then id else null end),
count(distinct case when day = '2021-02-01' then id else null end),
count(distinct case when day = '2021-03-01' then id else null end),
count(distinct case when day = '2021-04-01' then id else null end),
count(distinct case when day = '2021-05-01' then id else null end),
count(distinct case when day = '2021-06-01' then id else null end),
count(distinct case when day = '2021-07-01' then id else null end),
count(distinct case when day = '2021-08-01' then id else null end),
count(distinct case when day = '2021-09-01' then id else null end),
count(distinct case when day = '2021-10-01' then id else null end),
count(distinct case when day = '2021-11-01' then id else null end),
count(distinct case when day = '2021-12-01' then id else null end)
]) as avg_monthly_users
from (
select '123' as id, '2021-01-01' as day
union all
select '456' as id, '2021-02-01' as day
union all
select '123' as id, '2021-03-01' as day
)