2

我们有一个包含网站页面视图的表,例如:

time      | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC

在我们的生产表中,目前大约有 40K 行。我们希望每天生成过去 30 天、60 天和 90 天内查看的唯一页面的计数。因此,在结果集中,我们可以查找一天,并查看在该天之前的 60 天内访问了多少唯一页面。

我们能够让查询在 MSSQL 中工作:

SELECT DISTINCT
 CONVERT(VARCHAR,P.NDATE,101) AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

注意:因为 MSSQL 没有 FROM_UNIXTIME 函数,所以我们添加了 NDATE 列进行测试,它只是转换后的time. 生产表中不存在 NDATE。

将此查询转换为 MySQL 会给我们“未知列 P.time”错误:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

我理解这是因为我们不能有一个引用外部 FROM 子句中的表的相关子查询。但是,不幸的是,我们不知道如何将此查询转换为在 MySQL 中工作。现在,我们只需从表中返回所有 DISTINCT 行并在 PHP 中对其进行后处理。40K 行大约需要 2-3 秒。当我们有 100 行中的 100 行时,我担心性能。

可以在 MySQL 中做吗?如果是这样,我们能否期望它比我们的 PHP 后处理解决方案执行得更好。

更新: 这是创建表的查询:

CREATE TABLE  `perflog` (
    `user_id` VARBINARY( 40 ) NOT NULL ,
    `elapsed` float UNSIGNED NOT NULL ,
    `page_id` VARCHAR( 255 ) NOT NULL ,
    `time` INT( 10 ) UNSIGNED NOT NULL ,
    `ip` VARBINARY( 40 ) NOT NULL ,
    `agent` VARCHAR( 255 ) NOT NULL ,
    PRIMARY KEY (  `user_id` ,  `page_id` ,  `time` ,  `ip`,  `agent` )
) ENGINE MyISAM

到目前为止,我们的生产表有大约 40K 行!

4

4 回答 4

1

Note: I am writing this after reading solutions by @astander, @Donnie, @longneck.

I understand that performance is important, but why don't you store aggregates? Ten years of day-per-row is 3650 rows with only few columns each.

TABLE dimDate (DateKey int (PK), Year int, Day int, DayOfWeek varchar(10), DayInEpoch....)
TABLE AggVisits (DateKey int (PK,FK), Today int, Last30 int, Last60 int, Last90 int)

This way you would run the query only once at the end of the day, for one day only. Pre-calculated aggregates are at the root of any high-performance analytic solution (cubes).

UPDATE:
You could speed up those queries by introducing another column DayInEpoch int (day number since say 1990-01-01). Then you can remove all those date/time conversion functions.

于 2009-11-20T21:00:01.643 回答
0

将子选择更改为连接,如下所示:

select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perflog p
  join perflog p30 on FROM_UNIXTIME(p30.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p60 on FROM_UNIXTIME(p60.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p90 on FROM_UNIXTIME(p90.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')

但是,由于大量函数会杀死日期列上的任何索引,这可能会运行缓慢,更好的解决方案可能是:

create temporary table perf_tmp as
select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'VIEWDATE',
  page_id
from
  perflog;

create index perf_dt on perf_tmp (VIEWDATE);

select
  VIEWDATE, 
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perf_tmp p
  join perf_tmp p30 on p30.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 30 DAY) AND p.VIEWDATE
  join perf_tmp p60 on p60.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 60 DAY) AND p.VIEWDATE
  join perf_tmp p90 on p90.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 90 DAY) AND p.VIEWDATE;
于 2009-11-20T16:09:36.783 回答
0

为什么你把子查询埋在这样的第二层?试试这个:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
于 2009-11-20T15:55:30.437 回答
0

您可以尝试使用单选。

仅选择日期和 90 天前之间的值。

然后在每个字段中使用 case 语句来检查日期是否在 30、60、90 之间。对于每个字段,如果 case 为真,则为 1,否则为 0,并计数。

就像是

SELECT  SUM(CASE WHEN p.Date IN 30 PERIOD THEN 1 ELSE 0 END) Cnt30,
        SUM(CASE WHEN p.Date IN 60 PERIOD THEN 1 ELSE 0 END) Cnt60,
        SUM(CASE WHEN p.Date IN 90 PERIOD THEN 1 ELSE 0 END) Cnt90
FROM    Table
WHERE p.Date IN 90 PERIOD
于 2009-11-20T16:01:18.970 回答