使用公共数据进行此查询怎么样:
SELECT
a.day, first_day, return_next_day,
integer((return_next_day / first_day) * 100) percent
FROM (
SELECT COUNT(DISTINCT actor, 50000) first_day,
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
FROM
[publicdata:samples.github_timeline]
GROUP BY day) a
JOIN (
SELECT
COUNT(*) return_next_day, day
FROM (
SELECT
a.day day, a.actor, b.day, b.actor
FROM (
SELECT
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY(
PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus,
actor
FROM
[publicdata:samples.github_timeline]
GROUP EACH BY actor, day) a
JOIN EACH (
SELECT
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
actor
FROM
[publicdata:samples.github_timeline]
GROUP EACH BY actor, day) b
ON a.actor = b.actor
AND a.dayplus = b.day
)
GROUP BY day) b
ON a.day = b.day
这给了我想要的结果:
请注意,查询STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day
多次使用,将源字符串数据转换为日期。如果我拥有数据,我会事先在表上运行 ETL,以跳过这个重复的步骤。
该查询连接 2 个表:
还有其他方法,这只是众多方法中的一种。还可以进一步优化此查询。