sql - SQL 来确定访问的最小连续天数？

Question

以下用户历史记录表包含给定用户访问网站的每一天的一条记录（在 24 小时 UTC 时间段内）。它有数千条记录，但每个用户每天只有一条记录。如果用户当天没有访问该网站，则不会生成任何记录。

Id UserId CreationDate
------ ------ ------------
750997 12 2009-07-07 18:42:20.723
750998 15 2009-07-07 18:42:20.927
751000 19 2009-07-07 18:42:22.283

我正在寻找的是该表上具有良好性能的 SQL 查询，它告诉我哪些用户 ID 连续 (n) 天访问了该网站而没有错过任何一天。

换句话说，有多少用户在此表中有 (n) 条记录，其日期是连续的（前天或后天）？如果序列中缺少任何一天，则序列中断，应从 1 重新开始；我们正在寻找已经在此连续使用天数且没有间隔的用户。

当然，此查询与特定 Stack Overflow 标志之间的任何相似之处纯属巧合.. :)

score 149 · Accepted Answer

怎么样（请确保前面的语句以分号结尾）：

WITH numberedrows
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID 
                                       ORDER BY CreationDate)
                - DATEDIFF(day,'19000101',CreationDate) AS TheOffset,
                CreationDate,
                UserID
         FROM   tablename)
SELECT MIN(CreationDate),
       MAX(CreationDate),
       COUNT(*) AS NumConsecutiveDays,
       UserID
FROM   numberedrows
GROUP  BY UserID,
          TheOffset

这个想法是，如果我们有天数列表（作为一个数字）和一个 row_number，那么错过的天数会使这两个列表之间的偏移量稍微大一些。所以我们正在寻找一个具有一致偏移量的范围。

您可以在结尾处使用“ORDER BY NumConsecutiveDays DESC”，或者说“HAVING count(*) > 14”作为阈值...

不过，我还没有对此进行测试-只是将其从头顶写下来。希望适用于 SQL2005 及更高版本。

...并且会通过 tablename(UserID, CreationDate) 上的索引得到很大帮助

编辑：原来 Offset 是一个保留字，所以我改用了 TheOffset 。

已编辑：使用 COUNT(*) 的建议非常有效——我一开始就应该这样做，但并没有真正考虑。以前它使用 datediff(day, min(CreationDate), max(CreationDate)) 代替。

抢

score 70 · Accepted Answer

答案很明显：

SELECT DISTINCT UserId
FROM UserHistory uh1
WHERE (
       SELECT COUNT(*) 
       FROM UserHistory uh2 
       WHERE uh2.CreationDate 
       BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)
      ) = @days OR UserId = 52551

编辑：

好的，这是我的严肃回答：

DECLARE @days int
DECLARE @seconds bigint
SET @days = 30
SET @seconds = (@days * 24 * 60 * 60) - 1
SELECT DISTINCT UserId
FROM (
    SELECT uh1.UserId, Count(uh1.Id) as Conseq
    FROM UserHistory uh1
    INNER JOIN UserHistory uh2 ON uh2.CreationDate 
        BETWEEN uh1.CreationDate AND 
            DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))
        AND uh1.UserId = uh2.UserId
    GROUP BY uh1.Id, uh1.UserId
    ) as Tbl
WHERE Conseq >= @days

编辑：

[Jeff Atwood] 这是一个很棒的快速解决方案，值得被接受，但Rob Farley 的解决方案也非常出色，甚至可以说更快（！）。请也检查一下！

score 18 · Accepted Answer

如果您可以更改表架构，我建议LongestStreak您在表中添加一列，将其设置为以CreationDate. 在登录时更新表很容易（类似于您已经在做的事情，如果当天不存在任何行，您将检查前一天是否存在任何行。如果为真，您将LongestStreak增加新行，否则，您将其设置为 1。）

添加此列后查询将很明显：

if exists(select * from table
          where LongestStreak >= 30 and UserId = @UserId)
   -- award the Woot badge.

score 7 · Accepted Answer

一些很好表达的 SQL，如下所示：

select
        userId,
    dbo.MaxConsecutiveDates(CreationDate) as blah
from
    dbo.Logins
group by
    userId

假设您有一个用户定义的聚合函数（请注意这是错误的）：

using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Runtime.InteropServices;

namespace SqlServerProject1
{
    [StructLayout(LayoutKind.Sequential)]
    [Serializable]
    internal struct MaxConsecutiveState
    {
        public int CurrentSequentialDays;
        public int MaxSequentialDays;
        public SqlDateTime LastDate;
    }

    [Serializable]
    [SqlUserDefinedAggregate(
        Format.Native,
        IsInvariantToNulls = true, //optimizer property
        IsInvariantToDuplicates = false, //optimizer property
        IsInvariantToOrder = false) //optimizer property
    ]
    [StructLayout(LayoutKind.Sequential)]
    public class MaxConsecutiveDates
    {
        /// <summary>
        /// The variable that holds the intermediate result of the concatenation
        /// </summary>
        private MaxConsecutiveState _intermediateResult;

        /// <summary>
        /// Initialize the internal data structures
        /// </summary>
        public void Init()
        {
            _intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };
        }

        /// <summary>
        /// Accumulate the next value, not if the value is null
        /// </summary>
        /// <param name="value"></param>
        public void Accumulate(SqlDateTime value)
        {
            if (value.IsNull)
            {
                return;
            }
            int sequentialDays = _intermediateResult.CurrentSequentialDays;
            int maxSequentialDays = _intermediateResult.MaxSequentialDays;
            DateTime currentDate = value.Value.Date;
            if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))
                sequentialDays++;
            else
            {
                maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);
                sequentialDays = 1;
            }
            _intermediateResult = new MaxConsecutiveState
                                      {
                                          CurrentSequentialDays = sequentialDays,
                                          LastDate = currentDate,
                                          MaxSequentialDays = maxSequentialDays
                                      };
        }

        /// <summary>
        /// Merge the partially computed aggregate with this aggregate.
        /// </summary>
        /// <param name="other"></param>
        public void Merge(MaxConsecutiveDates other)
        {
            // add stuff for two separate calculations
        }

        /// <summary>
        /// Called at the end of aggregation, to return the results of the aggregation.
        /// </summary>
        /// <returns></returns>
        public SqlInt32 Terminate()
        {
            int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);
            return new SqlInt32(max);
        }
    }
}

score 5 · Accepted Answer

似乎您可以利用这样一个事实，即连续 n 天需要有 n 行。

所以像：

SELECT users.UserId, count(1) as cnt
FROM users
WHERE users.CreationDate > now() - INTERVAL 30 DAY
GROUP BY UserId
HAVING cnt = 30

score 4 · Accepted Answer

对我来说，使用单个 SQL 查询执行此操作似乎过于复杂。让我把这个答案分成两部分。

到目前为止你应该做的事情和现在应该开始做的事情：
运行一个每日 cron 作业，检查每个用户是否今天登录，然后如果他有，则增加一个计数器，如果他没有，则将其设置为 0。
您现在应该做什么：
- 将此表导出到不运行您的网站且暂时不需要的服务器。;)
- 按用户排序，然后按日期排序。
- 按顺序检查，保持计数器...

score 3 · Accepted Answer

您可以使用递归 CTE (SQL Server 2005+)：

WITH recur_date AS (
        SELECT t.userid,
               t.creationDate,
               DATEADD(day, 1, t.created) 'nextDay',
               1 'level' 
          FROM TABLE t
         UNION ALL
        SELECT t.userid,
               t.creationDate,
               DATEADD(day, 1, t.created) 'nextDay',
               rd.level + 1 'level'
          FROM TABLE t
          JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)
   SELECT t.*
    FROM recur_date t
   WHERE t.level = @numDays
ORDER BY t.userid

score 3 · Accepted Answer

Joe Celko 在 SQL for Smarties 中有一个完整的章节（称为运行和序列）。我家里没有那本书，所以当我上班时......我会真正回答这个问题。（假设历史表称为 dbo.UserHistory 并且天数是 @Days）

另一个线索来自SQL 团队关于运行的博客

我有另一个想法，但在这里没有方便的 SQL 服务器可以使用，是使用带有分区 ROW_NUMBER 的 CTE，如下所示：

WITH Runs
AS
  (SELECT UserID
         , CreationDate
         , ROW_NUMBER() OVER(PARTITION BY UserId
                             ORDER BY CreationDate)
           - ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak
                               ORDER BY CreationDate) AS RunNumber
  FROM
     (SELECT UH.UserID
           , UH.CreationDate
           , ISNULL((SELECT TOP 1 1 
              FROM dbo.UserHistory AS Prior 
              WHERE Prior.UserId = UH.UserId 
              AND Prior.CreationDate
                  BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)
                  AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak
      FROM dbo.UserHistory AS UH) AS Consecutive
)
SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd
FROM Runs
GROUP BY UserID, RunNumber
HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days

上面的内容可能比它必须的要难得多，但是当你对“跑步”有其他定义时，不仅仅是约会。

score 3 · Accepted Answer

几个SQL Server 2012 选项（假设下面的 N=100）。

;WITH T(UserID, NRowsPrevious)
     AS (SELECT UserID,
                DATEDIFF(DAY, 
                        LAG(CreationDate, 100) 
                            OVER 
                                (PARTITION BY UserID 
                                     ORDER BY CreationDate), 
                         CreationDate)
         FROM   UserHistory)
SELECT DISTINCT UserID
FROM   T
WHERE  NRowsPrevious = 100

尽管使用我的示例数据，以下结果更有效

;WITH U
         AS (SELECT DISTINCT UserId
             FROM   UserHistory) /*Ideally replace with Users table*/
    SELECT UserId
    FROM   U
           CROSS APPLY (SELECT TOP 1 *
                        FROM   (SELECT 
                                       DATEDIFF(DAY, 
                                                LAG(CreationDate, 100) 
                                                  OVER 
                                                   (ORDER BY CreationDate), 
                                                 CreationDate)
                                FROM   UserHistory UH
                                WHERE  U.UserId = UH.UserID) T(NRowsPrevious)
                        WHERE  NRowsPrevious = 100) O

两者都依赖于问题中所述的约束，即每个用户每天最多有一条记录。

score 2 · Accepted Answer

如果这对您来说如此重要，请获取此事件并驱动表格为您提供此信息。无需用所有这些疯狂的查询来杀死机器。

score 1 · Accepted Answer

我使用了一个简单的数学属性来确定谁连续访问了该站点。此属性是您应该使第一次访问和最后一次之间的天差等于访问表日志中的记录数。

这是我在 Oracle DB 中测试的 SQL 脚本（它也应该在其他 DB 中工作）：

-- show basic understand of the math properties 
  select    ceil(max (creation_date) - min (creation_date))
              max_min_days_diff,
           count ( * ) real_day_count
    from   user_access_log
group by   user_id;


-- select all users that have consecutively accessed the site 
  select   user_id
    from   user_access_log
group by   user_id
  having       ceil(max (creation_date) - min (creation_date))
           / count ( * ) = 1;



-- get the count of all users that have consecutively accessed the site 
  select   count(user_id) user_count
    from   user_access_log
group by   user_id
  having   ceil(max (creation_date) - min (creation_date))
           / count ( * ) = 1;

表准备脚本：

-- create table 
create table user_access_log (id           number, user_id      number, creation_date date);


-- insert seed data 
insert into user_access_log (id, user_id, creation_date)
  values   (1, 12, sysdate);

insert into user_access_log (id, user_id, creation_date)
  values   (2, 12, sysdate + 1);

insert into user_access_log (id, user_id, creation_date)
  values   (3, 12, sysdate + 2);

insert into user_access_log (id, user_id, creation_date)
  values   (4, 16, sysdate);

insert into user_access_log (id, user_id, creation_date)
  values   (5, 16, sysdate + 1);

insert into user_access_log (id, user_id, creation_date)
  values   (6, 16, sysdate + 5);

score 1 · Accepted Answer

像这样的东西？

select distinct userid
from table t1, table t2
where t1.UserId = t2.UserId 
  AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n
  AND (
    select count(*)
    from table t3
    where t1.UserId  = t3.UserId
      and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n
   ) = n

score 1 · Accepted Answer

declare @startdate as datetime, @days as int
set @startdate = cast('11 Jan 2009' as datetime) -- The startdate
set @days = 5 -- The number of consecutive days

SELECT userid
      ,count(1) as [Number of Consecutive Days]
FROM UserHistory
WHERE creationdate >= @startdate
AND creationdate < dateadd(dd, @days, cast(convert(char(11), @startdate, 113)  as datetime))
GROUP BY userid
HAVING count(1) >= @days

该语句cast(convert(char(11), @startdate, 113) as datetime)删除了日期的时间部分，因此我们从午夜开始。

我还假设creationdate和userid列已编入索引。

我刚刚意识到这不会告诉您所有用户及其连续天数。但会告诉您从您选择的日期起，哪些用户将访问固定天数。

修改后的解决方案：

declare @days as int
set @days = 30
select t1.userid
from UserHistory t1
where (select count(1) 
       from UserHistory t3 
       where t3.userid = t1.userid
       and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0) 
       and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + @days, 0) 
       group by t3.userid
) >= @days
group by t1.userid

我已经检查过了，它将查询所有用户和所有日期。它基于斯宾塞的第一个（笑话？）解决方案，但我的作品。

更新：改进了第二个解决方案中的日期处理。

score 0 · Accepted Answer

这应该可以满足您的要求，但我没有足够的数据来测试效率。令人费解的 CONVERT/FLOOR 内容是从日期时间字段中剥离时间部分。如果您使用的是 SQL Server 2008，那么您可以使用 CAST(x.CreationDate AS DATE)。

将@Range 声明为 INT
设置@范围 = 10

SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))
  FROM tblUserLogin a
存在于何处
   （选择 1
      FROM tblUserLogin b
     其中 a.userId = b.userId
       AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))
              FROM tblUserLogin c
             其中 c.userid = b.userid
               AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) 和 CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)) )+@Range-1) = @Range)

创建脚本

创建表 [dbo].[tblUserLogin](
    [Id] [int] IDENTITY(1,1) 非空，
    [用户 ID] [int] NULL，
    [创建日期] [日期时间] NULL
) 开 [主要]

score 0 · Accepted Answer

稍微调整比尔的查询。您可能必须在分组之前截断日期以仅计算每天一次登录...

SELECT UserId from History 
WHERE CreationDate > ( now() - n )
GROUP BY UserId, 
DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate  
HAVING COUNT(TruncatedCreationDate) >= n

编辑使用 DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) 而不是 convert( char(10) , CreationDate, 101 )。

@IDisposable 我之前想使用 datepart 但我懒得查找语法，所以我想 id 使用 convert 代替。我不知道它产生了重大影响谢谢！现在我知道。

score 0 · Accepted Answer

Spencer 几乎做到了，但这应该是工作代码：

SELECT DISTINCT UserId
FROM History h1
WHERE (
    SELECT COUNT(*) 
    FROM History
    WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, @n-1, h1.CreationDate)
) >= @n

score 0 · Accepted Answer

在我的脑海中，MySQLish：

SELECT start.UserId
FROM UserHistory AS start
  LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId
    AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)
  LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId
    AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)
WHERE pre_start.Id IS NULL
GROUP BY start.Id
HAVING COUNT(subsequent.Id)=30

未经测试，几乎可以肯定需要对 MSSQL 进行一些转换，但我认为这给出了一些想法。

score 0 · Accepted Answer

使用 Tally 表怎么样？它遵循更算法化的方法，执行计划轻而易举。使用从 1 到“MaxDaysBehind”之间的数字填充您要扫描的表格（即 90 将在 3 个月后查找等）。

declare @ContinousDays int
set @ContinousDays = 30  -- select those that have 30 consecutive days

create table #tallyTable (Tally int)
insert into #tallyTable values (1)
...
insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan

select [UserId],count(*),t.Tally from HistoryTable 
join #tallyTable as t on t.Tally>0
where [CreationDate]> getdate()-@ContinousDays-t.Tally and 
      [CreationDate]<getdate()-t.Tally 
group by [UserId],t.Tally 
having count(*)>=@ContinousDays

delete #tallyTable

score 0 · Accepted Answer

假设一个模式是这样的：

create table dba.visits
(
    id  integer not null,
    user_id integer not null,
    creation_date date not null
);

这将从具有间隙的日期序列中提取连续范围。

select l.creation_date  as start_d, -- Get first date in contiguous range
    (
        select min(a.creation_date ) as creation_date 
        from "DBA"."visits" a 
            left outer join "DBA"."visits" b on 
                   a.creation_date = dateadd(day, -1, b.creation_date ) and 
                   a.user_id  = b.user_id 
            where b.creation_date  is null and
                  a.creation_date  >= l.creation_date  and
                  a.user_id  = l.user_id 
    ) as end_d -- Get last date in contiguous range
from  "DBA"."visits" l
    left outer join "DBA"."visits" r on 
        r.creation_date  = dateadd(day, -1, l.creation_date ) and 
        r.user_id  = l.user_id 
    where r.creation_date  is null

sql - SQL 来确定访问的最小连续天数？

19 回答 19

Related

Reference