sql - 使用多个表提高 sql 的性能

Question

我有两个表：Log(id,user,action,date) 和 ActionTypes(action,type)。给定一个动作 A0 和一个类型 T0，我想为每个用户计算她在 A0 之后使用彼此动作 Ai 的次数，但跳过了不属于 T0 类型的 Log 动作。例如：

日志：

id   user   action        date
----------------------------------------
1    mary   start   2012-07-16 08:00:00
2    mary   open    2012-07-16 09:00:00
3    john   start   2012-07-16 09:00:00
4    mary   play    2012-07-16 10:00:00
5    john   open    2012-07-16 10:30:00
6    mary   start   2012-07-16 11:00:00
7    mary   jump    2012-07-16 12:00:00
8    mary   close   2012-07-16 13:00:00
9    mary   delete  2012-07-16 14:00:00
10   mary   start   2012-07-16 15:00:00
11   mary   open    2012-07-16 16:00:00

动作类型：

action  type
--------------
start   0
open    1
play    1
jump    2
close   1
delete  1

因此，给定动作“开始”和类型“1”，答案将是：

user   action    ntimes
------------------------
mary   open      2
mary   close     1
john   open      1

我的尝试是

SELECT b.user,b.action, count(*)
FROM log a, log b
WHERE a.action='start' AND b.date>a.date AND a.user=b.user AND
      1=(select type from ActionTypes where action=b.action) AND
      not exists (SELECT c.action FROM log c where c.user=a.user AND                  
                  c.date>a.date and c.date<b.date and                            
                  1=(select type from ActionTypes where action=c.action))
GROUP BY b.user,b.action

我们的 Log 表有大约 100 万个元组，并且查询工作正常，但速度太慢。我们正在使用 SQLServer。有关如何使其更快的任何提示？谢谢

score 3 · Accepted Answer

你能试试这个查询吗？它使用存在来测试以前的时间顺序记录是否是请求的类型。我相信它会比自加入更快。我已经放了一个演示@Sql Fiddle。

select log.[user], log.action, count(*) ntimes
  from log
 inner join actiontype t
    on log.action = t.action
 where t.type = 1
   and exists (select *
                 from 
                   (select top 1 t1.type
                      from log l1
                     inner join actiontype t1
                        on l1.action = t1.action
                     where l1.[user] = log.[user]
                       and l1.date < log.date
                       and t1.type in (0, 1)
                     order by l1.date desc
                   ) prevEntry
                where prevEntry.type = 0
               )
 group by log.[user], log.action

我不明白为什么mary\close在结果列表中。上一条记录是 jump 类型的2，不应该跳过来开始。

score 3 · Accepted Answer

借用@Nikola Markovinović 的设置后，我想出了以下解决方案：

WITH ranked AS (
  SELECT
    L1.[user],
    L2.action,
    rnk = ROW_NUMBER() OVER (PARTITION BY L1.id ORDER BY L2.date)
  FROM Log L1
    INNER JOIN Log L2 ON L2.[user] = L1.[user] AND L2.date > L1.date
    INNER JOIN ActionType at ON L2.action = at.action
  WHERE L1.action = @Action
    AND at.type   = @Type
)
SELECT
  [user],
  action,
  ntimes = COUNT(*)
FROM ranked
WHERE rnk = 1
GROUP BY
  [user],
  action
;

基本上，此查询从Log表中选择具有指定操作的所有用户记录，然后将该子集连接回Log以检索第一个子集中的所有指定类型的操作，并按升序排列date它们（使用该ROW_NUMBER()功能）。然后查询只检索排名为的行，按和对1它们进行分组，并对组中的行进行计数。useraction

您可以在 SQL Fiddle查看（并使用）一个工作示例。

score 2 · Accepted Answer

您的操作查询和所有关系字段成为整数而不是字符串要快得多。

加快查询速度的唯一方法是更改数据库的结构。关系必须被索引并且必须是整数而不是字符串。例如这样的：

id   user   action        date
----------------------------------------
1    mary   1   2012-07-16 08:00:00
2    mary   2   2012-07-16 09:00:00
3    john   3   2012-07-16 09:00:00
4    mary   1   2012-07-16 10:00:00
5    john   3   2012-07-16 10:30:00
6    mary   4   2012-07-16 11:00:00
7    mary   5   2012-07-16 12:00:00
8    mary   6   2012-07-16 13:00:00
9    mary   1   2012-07-16 14:00:00
10   mary   3   2012-07-16 15:00:00
11   mary   1   2012-07-16 16:00:00

将解决您的问题。

此外，如果您有 1-9 种操作类型，您可以对 tinyint 进行操作，并且如果您添加一个带有主键的 id 和 tinyint，那么您的查询肯定会更容易（使用简单的连接），并且您的数据库也会更多灵活应对未来的变化。例如，您可以拥有：

id action  type
--------------
1  start   0
2  open    1
3  play    1
4  jump    2
5  close   1
6  delete  1

其中 id 是主键，“Log”表中的“action”具有该 id 的外键。

我认为主要问题是您没有索引和外键关系。

score 0 · Accepted Answer

我完全不同意以下说法：

...成为整数而不是字符串要快得多

这不完全正确，一旦列action被索引，整数或字符串之间几乎没有区别。
...加快查询速度的唯一方法是更改数据库的结构

在这种情况下，可以通过多种方式优化查询：
- 避免过滤连接的数据集（Log x ActionTypes）并尝试更早地进行过滤（在下面的示例中，过滤发生在内部子选择中）。
- 避免重复过滤条件（where）。即使 sql server 内部会优化此查询器重复通常表明您正在执行计算几次，并且大多数时候您可以找到只能放置一次条件的解决方案（在下面的示例中，您可以将where条件放在之前group by）。
- 你最好的朋友是“SQL 查询分析器（优化器）”。它在 Sql Server Manager Studio 中的内置工具，它将向您显示考虑数据量的 sql 查询执行成本。它确实是一个很好的工具，有助于找到查询中的瓶颈。
这是简化的查询，它将产生您需要的结果（它是在 Oracle 上编写和测试的，因为我使用 ms sql server 已经有一段时间了）：

select
  "user",
  action,
  count(*)
from action_log
where action not in ( --exclusion criteria
    select action_type."action"from action_type where action_type."type" = 1
)
group by "user", action

sql - 使用多个表提高 sql 的性能

4 回答 4

Related

Reference