我正在尝试构建一个基础架构,用于按需快速运行回归,从包含我们网络服务器上所有历史活动的数据库中提取 apache 请求。为了通过确保我们仍然回归来自较小客户的请求来提高覆盖率,我想通过为每个客户检索最多 n 个(为了这个问题,比如说 10 个)请求来确保请求的分布。
我发现这里回答了许多类似的问题,其中最接近的似乎是SQL 查询,以在一系列 ID 中返回每个 ID 的前 N 行,但答案大部分是我已经尝试过的与性能无关的解决方案。例如,row_number() 分析函数准确地为我们提供了我们正在寻找的数据:
SELECT
*
FROM
(
SELECT
dailylogdata.*,
row_number() over (partition by dailylogdata.contextid order by occurrencedate) rn
FROM
dailylogdata
WHERE
shorturl in (?)
)
WHERE
rn <= 10;
但是,鉴于此表包含给定日期的数百万个条目,并且这种方法需要从索引中读取与我们的选择标准匹配的所有行才能应用 row_number 分析函数,因此性能很糟糕。我们最终选择了近一百万行,只是因为它们的 row_number 超过 10 而丢弃了绝大多数行。执行上述查询的统计数据:
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 1 | VIEW | | 1 | 4427K| 12222 |00:09:08.94 | 895K| 584K| 301 | | | | ||
||* 2 | WINDOW SORT PUSHED RANK | | 1 | 4427K| 13536 |00:09:08.94 | 895K| 584K| 301 | 2709K| 743K| 97M (1)| 4096 ||
|| 3 | PARTITION RANGE SINGLE | | 1 | 4427K| 932K|00:22:27.90 | 895K| 584K| 0 | | | | ||
|| 4 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 4427K| 932K|00:22:27.61 | 895K| 584K| 0 | | | | ||
||* 5 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 17345 | 932K|00:00:00.75 | 1448 | 0 | 0 | | | | ||
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=:SYS_B_2) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "DAILYLOGDATA"."CONTEXTID" ORDER BY "OCCURRENCEDATE")<=:SYS_B_2) |
| 5 - access("SHORTURL"=:P1) |
| |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
但是,如果我们只查询特定 contextid 的前 10 个结果,我们可以更快地执行此操作:
SELECT
*
FROM
(
SELECT
dailylogdata.*
FROM
dailylogdata
WHERE
shorturl in (?)
and contextid = ?
)
WHERE
rownum <= 10;
运行此查询的统计信息:
|-------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers ||
|-------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 10 |00:00:00.01 | 14 ||
||* 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.01 | 14 ||
|| 2 | PARTITION RANGE SINGLE | | 1 | 10 | 10 |00:00:00.01 | 14 ||
|| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| DAILYLOGDATA | 1 | 10 | 10 |00:00:00.01 | 14 ||
||* 4 | INDEX RANGE SCAN | DAILYLOGDATA_URLCONTEXT | 1 | 1 | 10 |00:00:00.01 | 5 ||
|-------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter(ROWNUM<=10) |
| 4 - access("SHORTURL"=:P1 AND "CONTEXTID"=TO_NUMBER(:P2)) |
| |
+-------------------------------------------------------------------------------------------------------------------------+
在这种情况下,Oracle 足够聪明,可以在获得 10 个结果后停止检索数据。我可以收集一组完整的 contextid 并以编程方式生成一个查询,该查询由每个 contextid 的该查询的一个实例和union all
整个混乱在一起,但考虑到 contextid 的绝对数量,我们可能会遇到内部 Oracle 限制,即使没有,这种方法有点杂乱无章。
有谁知道一种方法可以保持第一个查询的简单性,同时保持与第二个查询相称的性能?另请注意,我实际上并不关心检索一组稳定的行。只要它们满足我的标准,它们就可以用于回归。
编辑: Adam Musch 的建议成功了。我在此处将性能结果与他的更改一起附加,因为我无法将它们放入对他的回答的评论回复中。这次我还使用更大的数据集进行测试,以下是我原始 row_number 方法中的(缓存)统计数据进行比较:
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 1 | VIEW | | 1 | 1163K| 12624 |00:00:22.34 | 1186K| 931K| | | ||
||* 2 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:21.82 | 1186K| 931K| 3036M| 17M| ||
|| 3 | TABLE ACCESS BY INDEX ROWID| TWTEST | 1 | 1163K| 1213K|00:00:20.41 | 1186K| 931K| | | ||
||* 4 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.81 | 8568 | 0 | | | ||
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 1 - filter("RN"<=10) |
| 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "CONTEXTID" ORDER BY NULL )<=10) |
| 4 - access("SHORTURL"=:P1) |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
我冒昧地将亚当的建议简化了一点。这是修改后的查询...
select
*
from
twtest
where
rowid in (
select
rowid
from (
select
rowid,
shorturl,
row_number() over (partition by shorturl, contextid
order by null) rn
from
twtest
)
where rn <= 10
and shorturl in (?)
);
...以及来自其(缓存)评估的统计数据:
|--------------------------------------------------------------------------------------------------------------------------------------|
|| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem ||
|--------------------------------------------------------------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 1 | | 12624 |00:00:01.33 | 19391 | | | ||
|| 1 | NESTED LOOPS | | 1 | 1 | 12624 |00:00:01.33 | 19391 | | | ||
|| 2 | VIEW | VW_NSO_1 | 1 | 1163K| 12624 |00:00:01.27 | 6770 | | | ||
|| 3 | HASH UNIQUE | | 1 | 1 | 12624 |00:00:01.27 | 6770 | 1377K| 1377K| 5065K (0)||
||* 4 | VIEW | | 1 | 1163K| 12624 |00:00:01.25 | 6770 | | | ||
||* 5 | WINDOW NOSORT | | 1 | 1163K| 1213K|00:00:01.09 | 6770 | 283M| 5598K| ||
||* 6 | INDEX RANGE SCAN | TWTEST_URLCONTEXT | 1 | 1163K| 1213K|00:00:00.40 | 6770 | | | ||
|| 7 | TABLE ACCESS BY USER ROWID| TWTEST | 12624 | 1 | 12624 |00:00:00.04 | 12621 | | | ||
|--------------------------------------------------------------------------------------------------------------------------------------|
| |
|Predicate Information (identified by operation id): |
|--------------------------------------------------- |
| |
| 4 - filter("RN"<=10) |
| 5 - filter(ROW_NUMBER() OVER ( PARTITION BY "SHORTURL","CONTEXTID" ORDER BY NULL NULL )<=10) |
| 6 - access("SHORTURL"=:P1) |
| |
|Note |
|----- |
| - dynamic sampling used for this statement (level=2) |
| |
+--------------------------------------------------------------------------------------------------------------------------------------+
正如宣传的那样,我们只访问dailylogdata 表以获取完全过滤的行。我担心它似乎仍在根据它声称选择的行数(1213K)对 urlcontext 索引进行全面扫描,但鉴于它仅使用 6770 个缓冲区(即使我增加上下文特定结果的数量)这可能会产生误导。