3

我正在编写一份报告,该报告将使用导入的数据提供缺失序列的列表:

 CREATE  TABLE `client_trans` 
 (
   `id` INT NOT NULL AUTO_INCREMENT,
   `client_id` INT NULL,
   `sequence` INT NULL,
   `other_data` INT NULL,
   PRIMARY KEY (`id`),
   INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC) 
 );

除了 id 字段之外,没有真正唯一的值,甚至没有值的组合

此表的数据如下所示(忽略 other_data 字段):

id  client_id sequence
1   1000      1
2   1000      2
3   1000      2
4   1000      3
5   1001      1
6   1001      5
7   1001      6
8   1002      4
9   1002      6

如上例所示,可能有多个 client_id/sequence 组合,并且序列可能不是从 1(也不是 0)开始

虽然可以运行查询以查找缺失的序列,例如对此问题的答案有所不同,但这可能需要很长时间

此方法的替代方法是在将数据插入表之前或期间执行一些插入/更新查询(使用 Pentaho 数据集成工具)并使用包含缺失 client_id/sequence 值的附加表。这意味着在上面的示例中,当插入 (client_id, sequence) 值 (1001, 5) 时,使用类似于我在下面计算出的查询会发现缺少序列 2-4:

CREATE TABLE `missing_sequences` (
  `client_id` int(11),
  `miss_start` int(11),
  `miss_end` int(11),
) 

(请注意,为了更轻松地在 SQL 中测试查询,而不是在 Pentaho 中执行 SQL 语句,并且插入被注释掉,因此它只是一个选择)

SET @temp_id = 1001;    
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
  AND t1.seq < @temp_seq
  AND t2.id = @temp_id
  AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start

结果:

id       missing_start        missing_end
1001     2                    4

这将在某种程度上成功地填充缺少的序列表,但是当添加包含先前缺少的序列之一的行时,问题就出现了。
(最初我也有基于 client_id 和 miss_start 的主索引,它也会处理添加的重复值,但不完全确定这是否正确)

根据插入的序列号,存在四种可能性之一,例如:

@temp_seq = missing_start : (@temp_seq = 2) 
    update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
    split into two records
@temp_seq = missing_end : (@temp_seq = 4)
    update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
    delete record from missing_sequences table

这就是我的问题所在(如果您考虑到导入的数据可能未排序,则更早):
我如何满足 Pentaho 数据集成转换中的每种可能性以及初始插入和重复项?

编辑:经过一番头脑风暴,我想出了以下脚本,它在 MySQL 中运行时似乎可以正常工作,但在作为“执行 SQL 语句”触发器运行时却不行。这是(client_id,missing_start)的missing_sequences表上的主索引:

SET @orig_start = 0;
SET @orig_end = 0;

SET @temp_client_id = ?;
SET @temp_sequence = ?;

/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences 
WHERE client_id = @temp_client_id
  AND miss_start <= @temp_sequence
  AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */

/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;

/* Insert new value. This will insert the FIRST value in the table
   eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
  AND t1.sequence < @temp_sequence
  AND t2.client_id = @temp_client_id
  AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

为每一行执行并选中变量替换框,但执行似乎不一致或根本没有更新缺失的序列表

4

0 回答 0