在对数据进行切片和聚合(按时间或其他方式)时,星型模式(Kimball 星型)是一个相当简单但功能强大的解决方案。假设对于每次点击,我们都会存储时间(秒分辨率)、用户信息、按钮 ID 和用户位置。为了实现简单的切片和切块,我将从预先加载的查找表开始,用于查找很少更改的对象的属性——在 DW 世界中称为维度表。
该表每天都有一行,其中包含描述特定日期的属性(字段)的数量。该表可以提前数年预加载,如果包含以下字段,则应每天更新一次;否则它可能是“加载并忘记”。允许按日期属性轻松切片,例如
dimDate
DaysAgo, WeeksAgo, MonthsAgo, YearsAgo
dimDate
WHERE [YEAR] = 2009 AND DayOfWeek = 'Sunday'
对于十年的数据,该表只有约 3650 行。
该dimGeography
表预先加载了感兴趣的地理区域——行数取决于报告中所需的“地理分辨率”,它允许数据切片,如
WHERE Continent = 'South America'
一旦加载,它很少改变。
对于站点的每个按钮,在 dimButton 表中有一行,因此查询可能有
WHERE PageURL = 'http://…/somepage.php'
该dimUser
表每个注册用户有一行,一旦用户注册,该表就应该加载新的用户信息,或者至少在任何其他用户事务记录在事实表中之前,新用户信息应该在表中。
要记录按钮点击,我将添加factClick
表格。
在某个时间点,特定用户每点击一次按钮,该表就有一行。我已经使用(第二个分辨率),并在复合主键中以比特定用户每秒一次的速度更快地过滤掉点击。请注意该字段,它包含 的小时部分,一个 0-23 范围内的整数,以允许每小时轻松切片,例如
factClick
TimeStamp
ButtonKey
UserKey
Hour
TimeStamp
WHERE [HOUR] BETWEEN 7 AND 9
所以,现在我们必须考虑:
- 如何加载表?使用 ETL 工具或使用某种事件流式处理的低延迟解决方案定期(可能每隔一小时或每隔几分钟)从博客中获取。
- 表格中的信息保留多长时间?
不管表是只保存一天的信息还是保存几年的信息——它都应该被分区;ConcernedOfTunbridgeW在他的回答中解释了分区,所以我会在这里跳过。
现在,根据不同属性(包括日期和时间)进行切片和切块的一些示例
为了简化查询,我将添加一个视图来展平模型:
/* To simplify queries flatten the model */
CREATE VIEW vClicks
AS
SELECT *
FROM factClick AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimUser AS u ON u.UserKey = f.UserKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
查询示例
/*
Count number of times specific users clicked any button
today between 7 and 9 AM (7:00 - 9:59)
*/
SELECT [Email]
,COUNT(*) AS [Counter]
FROM vClicks
WHERE [DaysAgo] = 0
AND [Hour] BETWEEN 7 AND 9
AND [Email] IN ('dude45@somemail.com', 'bob46@bobmail.com')
GROUP BY [Email]
ORDER BY [Email]
假设我对User = ALL
. 这dimUser
是一个大表,所以我会在没有它的情况下创建一个视图,以加快查询速度。
/*
Because dimUser can be large table it is good
to have a view without it, to speed-up queries
when user info is not required
*/
CREATE VIEW vClicksNoUsr
AS
SELECT *
FROM factClick AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
查询示例
/*
Count number of times a button was clicked on a specific page
today and yesterday, for each hour.
*/
SELECT [FullDate]
,[Hour]
,COUNT(*) AS [Counter]
FROM vClicksNoUsr
WHERE [DaysAgo] IN ( 0, 1 )
AND PageURL = 'http://...MyPage'
GROUP BY [FullDate], [Hour]
ORDER BY [FullDate] DESC, [Hour] DESC
假设对于聚合,我们不需要保留特定的用户信息,而只对日期、小时、按钮和地理位置感兴趣。表格中的每一行factClickAgg
都有一个计数器,用于记录从特定地理区域单击特定按钮的每一小时。
该factClickAgg
表可以每小时加载一次,甚至可以在每天结束时加载——取决于报告和分析的要求。例如,假设表在每天结束时(午夜之后)加载,我可以使用类似:
/* At the end of each day (after midnight) aggregate data. */
INSERT INTO factClickAgg
SELECT DateKey
,[Hour]
,ButtonKey
,GeographyKey
,COUNT(*) AS [ClickCount]
FROM vClicksNoUsr
WHERE [DaysAgo] = 1
GROUP BY DateKey
,[Hour]
,ButtonKey
,GeographyKey
为了简化查询,我将创建一个视图来展平模型:
/* To simplify queries for aggregated data */
CREATE VIEW vClicksAggregate
AS
SELECT *
FROM factClickAgg AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
现在我可以查询聚合数据,例如按天:
/*
Number of times a specific buttons was clicked
in year 2009, by day
*/
SELECT FullDate
,SUM(ClickCount) AS [Counter]
FROM vClicksAggregate
WHERE ButtonName = 'MyBtn_1'
AND [Year] = 2009
GROUP BY FullDate
ORDER BY FullDate
或者有更多选择
/*
Number of times specific buttons were clicked
in year 2008, on Saturdays, between 9:00 and 11:59 AM
by users from Africa
*/
SELECT SUM(ClickCount) AS [Counter]
FROM vClicksAggregate
WHERE [Year] = 2008
AND [DayOfWeek] = 'Saturday'
AND [Hour] BETWEEN 9 AND 11
AND Continent = 'Africa'
AND ButtonName IN ( 'MyBtn_1', 'MyBtn_2', 'MyBtn_3' )