17

Suppose I have a table with 3 columns:

  • id (PK, int)
  • timestamp (datetime)
  • title (text)

I have the following records:

1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title

I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.

There is a similar question here: Mysql DateTime group by 15 mins

I also found this: http://www.artfulsoftware.com/infotree/queries.php#106

I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.

Does this make sense? Please let me know if further clarification is needed.

EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.

I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.

4

5 回答 5

18

我正在使用 Tom H. 的好主意,但在这里做的有点不同:

我们可以找到所有作为链开始的时间,而不是找到所有作为链开始的行,然后返回并找到与时间匹配的行。

此处的查询 #1 应该通过找出哪些时间在其下方但在 3 秒内没有任何时间来告诉您哪些时间是链的开始:

SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
    AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL

然后对于每一行,我们可以找到小于查询 #2 的时间戳的最大链起始时间戳:

SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id

一旦我们有了它,我们就可以根据需要对其进行分组。

SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime

我不完全确定这与汤姆 H 的答案是否足够不同,可以单独发布,但听起来你在实施方面遇到了麻烦,我正在考虑它,所以我想我会再次发布。祝你好运!

于 2011-07-02T09:42:33.913 回答
6

现在我认为我理解了你的问题,根据你对 OMG Ponies 的评论回复,我认为我有一个基于集合的解决方案。这个想法是首先根据标题找到任何链的开始。链的开始将被定义为在该行之前三秒内没有匹配的任何行:

SELECT
    MT1.my_id,
    MT1.title,
    MT1.my_time
FROM
    My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
    MT2.title = MT1.title AND
    (
        MT2.my_time < MT1.my_time OR
        (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
    ) AND
    MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
    MT2.my_id IS NULL

现在我们可以假设任何非链启动器都属于出现在它们之前的链启动器。由于 MySQL 不支持 CTE,您可能希望将上述结果放入一个临时表中,因为这样可以节省您对下面同一个子查询的多个连接。

SELECT
    SQ1.my_id,
    COUNT(*)  -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
    MT3.title = SQ1.title AND
    MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ2 ON
    SQ2.title = SQ1.title AND
    SQ2.my_time > SQ1.my_time AND
    SQ2.my_time <= MT3.my_time
WHERE
    SQ2.my_id IS NULL

如果您可以使用 CTE 或使用临时表,这看起来会简单得多。使用临时表也可能有助于提高性能。

此外,如果您可以拥有完全匹配的时间戳,则会出现此问题。如果是这种情况,那么您将需要稍微调整查询以使用 id 和时间戳的组合来区分具有匹配时间戳值的行。

编辑:更改查询以按时间戳处理完全匹配。

于 2011-07-01T19:59:36.940 回答
2

I like @Chris Cunningham's answer, but here's another take on it.

First, my understanding of your problem statement (correct me if I'm wrong):

You want to look at your event log as a sequence, ordered by the time of the event, and partitition it into groups, defining the boundary as being an interval of more than 3 seconds between two adjacent rows in the sequence.

I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.

So, first our event log table:

--
-- our event log table
--
create table dbo.eventLog
(
  id       int          not null ,
  dtLogged datetime     not null ,
  title    varchar(200) not null ,

  primary key nonclustered ( id ) ,
  unique clustered ( dtLogged , id ) ,

)

Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:

  • The innermost select defines the upper bound of each group. That upper boundary defines a group.
  • The outer select defines the lower bound of each group.

Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.

[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]

select dtFrom = min( t.dtFrom ) ,
       dtThru =      t.dtThru
from ( select dtFrom = t1.dtLogged ,
              dtThru = min( t2.dtLogged )
       from      dbo.EventLog t1
       left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                and datediff(second,t1.dtLogged,t2.dtLogged) > 3
       group by t1.dtLogged
     ) t
group by t.dtThru

You could then pull rows from the event log and tag them with the group to which they belong thus:

select *
from ( select dtFrom = min( t.dtFrom ) ,
              dtThru =      t.dtThru
       from ( select dtFrom = t1.dtLogged ,
                     dtThru = min( t2.dtLogged )
              from      dbo.EventLog t1
              left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                       and datediff(second,t1.dtLogged,t2.dtLogged) > 3
              group by t1.dtLogged
            ) t
       group by t.dtThru
     ) period
join dbo.EventLog t on t.dtLogged >=           period.dtFrom
                   and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged

Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.

于 2011-07-01T19:27:30.637 回答
2

简单查询:

SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);
于 2013-03-12T15:31:33.230 回答
2

警告:长答案。这应该可以工作,并且相当整洁,除了中间的一步,您必须愿意一遍又一遍地运行 INSERT 语句,直到它不做任何事情,因为我们不能在 MySQL 中做递归 CTE 事情。

我将使用此数据而不是您的数据作为示例:

id    Timestamp
1     1:00:00
2     1:00:03
3     1:00:06
4     1:00:10

这是要编写的第一个查询:

SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b 
ON (a.Timestamp is within 3 seconds of b.Timestamp)

它返回:

aid     bid
1       1
1       2
2       1
2       2
2       3
3       2
3       3
4       4

让我们创建一个漂亮的表来保存那些不允许重复的东西:

CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)

现在的挑战是找到类似该关系的传递闭包的东西。

为此,让我们找到下一级链接。我的意思是,既然我们有1 2并且2 3在邻接表中,我们应该添加1 3

INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)

这是不优雅的部分:您需要一遍又一遍地运行上面的 INSERT 语句,直到它不向表中添加任何行。我不知道是否有一种巧妙的方法可以做到这一点。

一旦这结束了,你将有一个像这样的传递闭合关系:

aid     bid
1       1
1       2
1       3     --added
2       1
2       2
2       3
3       1     --added
3       2
3       3
4       4

现在是妙语:

SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid

返回:

aid     Neighbors
1       1,2,3
2       1,2,3
3       1,2,3
4       4

所以

SELECT DISTINCT Neighbors
FROM (
     SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
     FROM Adjacency
     GROUP BY aid
     ) Groupings

返回

Neighbors
1,2,3
4

哇!

于 2011-07-01T18:49:22.297 回答