示例相关表架构:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
我需要在滚动时间段(90 天)内对大型数据集上的活跃不同用户进行计数,并且由于数据集的大小而遇到问题。
起初,我尝试使用窗口函数,类似于这里的答案。 https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
然而,这导致每天获得不同数量的用户,然后将它们相加 - 但如果它们出现多次,则不同可能会在窗口内重复。因此,这并不是对 90 天内不同用户的真正准确衡量。
我尝试的下一件事是使用以下解决方案 https://stackoverflow.com/a/47659590 - 将每个窗口的所有不同 user_ids 连接到一个数组,然后计算其中的不同点。
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
然而,这很快就用完了任何大的内存。
接下来是使用 HLL 草图以小得多的值来表示不同的 ID,因此内存问题就不那么重要了。我以为我的问题已经解决了,但是在运行以下命令时出现错误:错误只是“不支持函数 MERGE_PARTIAL”。我也尝试了 MERGE 并得到了同样的错误。它仅在使用窗口功能时发生。为每天的价值创建草图效果很好。
我通读了 BigQuery Standard SQL 文档,没有看到任何关于 HLL_COUNT.MERGE_PARTIAL 和 HLL_COUNT.MERGE 的窗口函数。大概这应该取 90 个草图并将它们组合成一个 HLL 草图,代表 90 个原始草图之间的不同值?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
任何想法为什么会发生此错误或如何调整?