您实际上可以使用SQLite 窗口函数来执行此操作,它可以解决 Ray Kiddy 描述的“迭代”部分。
假设您的时间戳是简单的 unix 时间戳,这将计算每个访问者的每个页面视图之间的非活动时间:
SELECT
utc_time,
visitor_id,
-- The window function: resolves the expression for the preceding row of the current partition
LAG(utc_time) OVER (
-- The window defition: Partitions all rows per visitor_id and orders each partition's rows by timestamps
PARTITION BY visitor_id
ORDER BY
utc_time
) -- Substract the utc_time of the current row from the utc_time of the preceding row to get the time between rows
- utc_time AS inactivity_time
FROM page_view
ORDER BY
visitor_id,
utc_time;
上述查询的结果可用于通过后续查询实际分配会话 ID。使用非活动时间大于所需阈值的那些行,如果是第一个会话,则为 NULL,您可以使用另一个窗口函数 (row_number) 来唯一标识会话,包括开始时间和下一个会话的开始时间:
SELECT
-- Calculate the session id based on the visitor and the consecutive row number (we only handle session starts here)
page_view.visitor_id || '-' || row_number() OVER(
PARTITION BY page_view.visitor_id
ORDER BY
page_view.utc_time
) AS session_id,
page_view.visitor_id,
page_view.utc_time AS session_start_at,
lead(utc_time) OVER(
PARTITION BY page_view.visitor_id
ORDER BY
page_view.utc_time
) AS next_session_start_at
FROM (...) AS page_view
WHERE
-- Filter for page views with an inactivity time greater 30 mins, these are session starts
ABS(page_view.inactivity_time) > 30 * 60
OR page_view.inactivity_time IS NULL;
鉴于此,您可能希望将结果存储在临时表中以保持清晰。
假设结果存储在“会话”表中,您最终可以通过将页面视图与其对应的会话相结合来计算一些有用的统计信息:
SELECT
session_id,
-- calculate the session duration
ABS(
MIN(page_view.utc_time) - MAX(page_view.utc_time)
) AS duration,
-- show distinct paths per session
COUNT(DISTINCT page_view.path)
FROM session
LEFT JOIN page_view ON page_view.visitor_id = session.visitor_id
AND page_view.utc_time >= session.session_start_at
AND (
page_view.utc_time < session.next_session_start_at
OR session.next_session_start_at IS NULL
)
GROUP BY
1
我建议从第一个查询开始并逐步提高,这有助于我了解发生了什么。
这里列出的大多数查询都来自这篇博文,我稍微调整了它们以在 SQLite 中工作。