5

我正在尝试构建一个基础架构,用于按需快速运行回归,从包含我们网络服务器上所有历史活动的数据库中提取 apache 请求。为了通过确保我们仍然回归来自较小客户的请求来提高覆盖率,我想通过为每个客户检索最多 n 个(为了这个问题,比如说 10 个)请求来确保请求的分布。

我发现这里回答了许多类似的问题,其中最接近的似乎是SQL 查询,以在一系列 ID 中返回每个 ID 的前 N ​​行,但答案大部分是我已经尝试过的与性能无关的解决方案。例如,row_number() 分析函数准确地为我们提供了我们正在寻找的数据:

SELECT
    *
FROM
    (
    SELECT
        dailylogdata.*,
        row_number() over (partition by dailylogdata.contextid order by occurrencedate) rn
    FROM
        dailylogdata
    WHERE
        shorturl in (?)
    )
WHERE
    rn <= 10;

但是,鉴于此表包含给定日期的数百万个条目,并且这种方法需要从索引中读取与我们的选择标准匹配的所有行才能应用 row_number 分析函数,因此性能很糟糕。我们最终选择了近一百万行,只是因为它们的 row_number 超过 10 而丢弃了绝大多数行。执行上述查询的统计数据:

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation                            | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
||   0 | SELECT STATEMENT                     |                         |      1 |        |  12222 |00:09:08.94 |     895K|    584K|    301 |       |       |          |         ||
||*  1 |  VIEW                                |                         |      1 |   4427K|  12222 |00:09:08.94 |     895K|    584K|    301 |       |       |          |         ||
||*  2 |   WINDOW SORT PUSHED RANK            |                         |      1 |   4427K|  13536 |00:09:08.94 |     895K|    584K|    301 |  2709K|   743K|   97M (1)|    4096 ||
||   3 |    PARTITION RANGE SINGLE            |                         |      1 |   4427K|    932K|00:22:27.90 |     895K|    584K|      0 |       |       |          |         ||
||   4 |     TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA            |      1 |   4427K|    932K|00:22:27.61 |     895K|    584K|      0 |       |       |          |         ||
||*  5 |      INDEX RANGE SCAN                | DAILYLOGDATA_URLCONTEXT |      1 |  17345 |    932K|00:00:00.75 |    1448 |      0 |      0 |       |       |          |         ||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                                                 |
|Predicate Information (identified by operation id):                                                                                                                              |
|---------------------------------------------------                                                                                                                              |
|                                                                                                                                                                                 |
|   1 - filter("RN"<=:SYS_B_2)                                                                                                                                                    |
|   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DAILYLOGDATA"."CONTEXTID" ORDER BY "OCCURRENCEDATE")<=:SYS_B_2)                                                                  |
|   5 - access("SHORTURL"=:P1)                                                                                                                                                    |
|                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

但是,如果我们只查询特定 contextid 的前 10 个结果,我们可以更快地执行此操作:

SELECT
    *
FROM
    (
    SELECT
        dailylogdata.*
    FROM
        dailylogdata
    WHERE
        shorturl in (?)
        and contextid = ?
    )
WHERE
    rownum <= 10;

运行此查询的统计信息:

|-------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation                           | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers ||
|-------------------------------------------------------------------------------------------------------------------------|
||   0 | SELECT STATEMENT                    |                         |      1 |        |     10 |00:00:00.01 |      14 ||
||*  1 |  COUNT STOPKEY                      |                         |      1 |        |     10 |00:00:00.01 |      14 ||
||   2 |   PARTITION RANGE SINGLE            |                         |      1 |     10 |     10 |00:00:00.01 |      14 ||
||   3 |    TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA            |      1 |     10 |     10 |00:00:00.01 |      14 ||
||*  4 |     INDEX RANGE SCAN                | DAILYLOGDATA_URLCONTEXT |      1 |      1 |     10 |00:00:00.01 |       5 ||
|-------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                         |
|Predicate Information (identified by operation id):                                                                      |
|---------------------------------------------------                                                                      |
|                                                                                                                         |
|   1 - filter(ROWNUM<=10)                                                                                                |
|   4 - access("SHORTURL"=:P1 AND "CONTEXTID"=TO_NUMBER(:P2))                                                             |
|                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------+

在这种情况下,Oracle 足够聪明,可以在获得 10 个结果后停止检索数据。我可以收集一组完整的 contextid 并以编程方式生成一个查询,该查询由每个 contextid 的该查询的一个实例和union all整个混乱在一起,但考虑到 contextid 的绝对数量,我们可能会遇到内部 Oracle 限制,即使没有,这种方法有点杂乱无章。

有谁知道一种方法可以保持第一个查询的简单性,同时保持与第二个查询相称的性能?另请注意,我实际上并不关心检索一组稳定的行。只要它们满足我的标准,它们就可以用于回归。

编辑: Adam Musch 的建议成功了。我在此处将性能结果与他的更改一起附加,因为我无法将它们放入对他的回答的评论回复中。这次我还使用更大的数据集进行测试,以下是我原始 row_number 方法中的(缓存)统计数据进行比较:

|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation                     | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
||   0 | SELECT STATEMENT              |                   |      1 |        |  12624 |00:00:22.34 |    1186K|    931K|       |       |          ||
||*  1 |  VIEW                         |                   |      1 |   1163K|  12624 |00:00:22.34 |    1186K|    931K|       |       |          ||
||*  2 |   WINDOW NOSORT               |                   |      1 |   1163K|   1213K|00:00:21.82 |    1186K|    931K|  3036M|    17M|          ||
||   3 |    TABLE ACCESS BY INDEX ROWID| TWTEST            |      1 |   1163K|   1213K|00:00:20.41 |    1186K|    931K|       |       |          ||
||*  4 |     INDEX RANGE SCAN          | TWTEST_URLCONTEXT |      1 |   1163K|   1213K|00:00:00.81 |    8568 |      0 |       |       |          ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                 |
|Predicate Information (identified by operation id):                                                                                              |
|---------------------------------------------------                                                                                              |
|                                                                                                                                                 |
|   1 - filter("RN"<=10)                                                                                                                          |
|   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "CONTEXTID" ORDER BY  NULL )<=10)                                                                 |
|   4 - access("SHORTURL"=:P1)                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

我冒昧地将亚当的建议简化了一点。这是修改后的查询...

select
    *
from
    twtest
where
    rowid in (
    select
            rowid
    from (
            select
                    rowid,
                    shorturl,
                    row_number() over (partition by shorturl, contextid
                                                      order by null) rn
            from
                    twtest
    )
    where rn <= 10
    and shorturl in (?)
);

...以及来自其(缓存)评估的统计数据:

|--------------------------------------------------------------------------------------------------------------------------------------|
|| Id  | Operation                   | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem ||
|--------------------------------------------------------------------------------------------------------------------------------------|
||   0 | SELECT STATEMENT            |                   |      1 |        |  12624 |00:00:01.33 |   19391 |       |       |          ||
||   1 |  NESTED LOOPS               |                   |      1 |      1 |  12624 |00:00:01.33 |   19391 |       |       |          ||
||   2 |   VIEW                      | VW_NSO_1          |      1 |   1163K|  12624 |00:00:01.27 |    6770 |       |       |          ||
||   3 |    HASH UNIQUE              |                   |      1 |      1 |  12624 |00:00:01.27 |    6770 |  1377K|  1377K| 5065K (0)||
||*  4 |     VIEW                    |                   |      1 |   1163K|  12624 |00:00:01.25 |    6770 |       |       |          ||
||*  5 |      WINDOW NOSORT          |                   |      1 |   1163K|   1213K|00:00:01.09 |    6770 |   283M|  5598K|          ||
||*  6 |       INDEX RANGE SCAN      | TWTEST_URLCONTEXT |      1 |   1163K|   1213K|00:00:00.40 |    6770 |       |       |          ||
||   7 |   TABLE ACCESS BY USER ROWID| TWTEST            |  12624 |      1 |  12624 |00:00:00.04 |   12621 |       |       |          ||
|--------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                      |
|Predicate Information (identified by operation id):                                                                                   |
|---------------------------------------------------                                                                                   |
|                                                                                                                                      |
|   4 - filter("RN"<=10)                                                                                                               |
|   5 - filter(ROW_NUMBER() OVER ( PARTITION BY "SHORTURL","CONTEXTID" ORDER BY NULL NULL )<=10)                                       |
|   6 - access("SHORTURL"=:P1)                                                                                                         |
|                                                                                                                                      |
|Note                                                                                                                                  |
|-----                                                                                                                                 |
|   - dynamic sampling used for this statement (level=2)                                                                               |
|                                                                                                                                      |
+--------------------------------------------------------------------------------------------------------------------------------------+

正如宣传的那样,我们只访问dailylogdata 表以获取完全过滤的行。我担心它似乎仍在根据它声称选择的行数(1213K)对 urlcontext 索引进行全面扫描,但鉴于它仅使用 6770 个缓冲区(即使我增加上下文特定结果的数量)这可能会产生误导。

4

4 回答 4

4

这是一种 janky 解决方案,但它似乎可以满足您的要求:尽快缩短索引扫描,并且在通过过滤条件和 top-n 查询条件都符合条件之前不要读取数据。

请注意,它是使用shorturl =条件而不是shorturl IN条件进行测试的。

with rowid_list as
(select rowid
   from (select *
           from (select rowid,
                        row_number() over (partition by shorturl, contextid
                                           order by null) rn
                   from dailylogdata
                )
          where rn <= 10
        )
  where shorturl = ? 
)
select * 
  from dailylogdata
 where rowid in (select rowid from rowid_list)

该子句获取前 10 个 rowid,为满足您的条件的和with的每个唯一组合过滤 WINDOW NOSORT 。然后它遍历那组rowid,按rowid获取每个。shorturlcontextid

----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                      |     1 |   286 |  1536   (1)| 00:00:19 |
|   1 |  NESTED LOOPS               |                      |     1 |   286 |  1536   (1)| 00:00:19 |
|   2 |   VIEW                      | VW_NSO_1             |   136K|  1596K|   910   (1)| 00:00:11 |
|   3 |    HASH UNIQUE              |                      |     1 |  3326K|            |          |
|*  4 |     VIEW                    |                      |   136K|  3326K|   910   (1)| 00:00:11 |
|*  5 |      WINDOW NOSORT          |                      |   136K|  2794K|   910   (1)| 00:00:11 |
|*  6 |       INDEX RANGE SCAN      | TABLE_REDACTED_INDEX |   136K|  2794K|   910   (1)| 00:00:11 |
|   7 |   TABLE ACCESS BY USER ROWID| TABLE_REDACTED       |     1 |   274 |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("RN"<=10)
   5 - filter(ROW_NUMBER() OVER ( PARTITION BY "CLIENT_ID","SCE_ID" ORDER BY NULL NULL
              )<=10)
   6 - access("TABLE_REDACTED"."SHORTURL"=:b1)
于 2012-03-02T21:52:39.077 回答
0

我认为您还应该检查其他方式/查询以实现相同的结果集。


Self-JOIN / GROUP BY

SELECT
    d.*
  , COUNT(*) AS rn

FROM 
        dailylogdata AS d 
    LEFT OUTER JOIN
        dailylogdata AS d2 
            ON  d.contextid = d2.contextid 
            AND d.occurrencedate >= d2.occurrencedate) 
            AND d2.shorturl IN (?)

WHERE
    d.shorturl IN (?)

GROUP BY 
    d.* 

HAVING 
    COUNT(*) <= 10

还有一个我不知道它是否正常工作的:

SELECT
    d.*
  , COUNT(*) AS rn

FROM 
        ( SELECT DISTINCT
              contextid
          FROM 
              dailylogdata 
          WHERE
              shorturl IN (?)
        ) AS dd 
    JOIN
        dailylogdata AS d
            ON  d.PK IN 
                ( SELECT
                      d10.PK
                  FROM
                      dailylogdata AS d10  
                  WHERE
                      d10.contextid = dd.contextid 
                    AND
                      d10.shorturl IN (?)
                    AND
                      rownum <= 10
                  ORDER BY 
                      d10.occurrencedate
                )
于 2012-02-28T19:15:49.113 回答
0

这似乎是那种一直在占用的时间。您的聚集索引是否occurrenceDate,如果没有,如果您更改为聚集索引的顺序,是否会更快?即,如果它是按顺序id 聚集的,则按顺序排列。

于 2012-02-28T16:04:30.787 回答
0

上次我只是将最后最有趣的行缓存在一个较小的表中。使用我的数据分布,在每次插入时更新缓存表而不是查询批量表更便宜。

于 2012-02-28T16:44:14.880 回答