我正在编写一份报告,该报告将使用导入的数据提供缺失序列的列表:
CREATE TABLE `client_trans`
(
`id` INT NOT NULL AUTO_INCREMENT,
`client_id` INT NULL,
`sequence` INT NULL,
`other_data` INT NULL,
PRIMARY KEY (`id`),
INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC)
);
除了 id 字段之外,没有真正唯一的值,甚至没有值的组合
此表的数据如下所示(忽略 other_data 字段):
id client_id sequence
1 1000 1
2 1000 2
3 1000 2
4 1000 3
5 1001 1
6 1001 5
7 1001 6
8 1002 4
9 1002 6
如上例所示,可能有多个 client_id/sequence 组合,并且序列可能不是从 1(也不是 0)开始
虽然可以运行查询以查找缺失的序列,例如对此问题的答案有所不同,但这可能需要很长时间
此方法的替代方法是在将数据插入表之前或期间执行一些插入/更新查询(使用 Pentaho 数据集成工具)并使用包含缺失 client_id/sequence 值的附加表。这意味着在上面的示例中,当插入 (client_id, sequence) 值 (1001, 5) 时,使用类似于我在下面计算出的查询会发现缺少序列 2-4:
CREATE TABLE `missing_sequences` (
`client_id` int(11),
`miss_start` int(11),
`miss_end` int(11),
)
(请注意,为了更轻松地在 SQL 中测试查询,而不是在 Pentaho 中执行 SQL 语句,并且插入被注释掉,因此它只是一个选择)
SET @temp_id = 1001;
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
AND t1.seq < @temp_seq
AND t2.id = @temp_id
AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start
结果:
id missing_start missing_end
1001 2 4
这将在某种程度上成功地填充缺少的序列表,但是当添加包含先前缺少的序列之一的行时,问题就出现了。
(最初我也有基于 client_id 和 miss_start 的主索引,它也会处理添加的重复值,但不完全确定这是否正确)
根据插入的序列号,存在四种可能性之一,例如:
@temp_seq = missing_start : (@temp_seq = 2)
update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
split into two records
@temp_seq = missing_end : (@temp_seq = 4)
update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
delete record from missing_sequences table
这就是我的问题所在(如果您考虑到导入的数据可能未排序,则更早):
我如何满足 Pentaho 数据集成转换中的每种可能性以及初始插入和重复项?
编辑:经过一番头脑风暴,我想出了以下脚本,它在 MySQL 中运行时似乎可以正常工作,但在作为“执行 SQL 语句”触发器运行时却不行。这是(client_id,missing_start)的missing_sequences表上的主索引:
SET @orig_start = 0;
SET @orig_end = 0;
SET @temp_client_id = ?;
SET @temp_sequence = ?;
/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences
WHERE client_id = @temp_client_id
AND miss_start <= @temp_sequence
AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */
/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;
/* Insert new value. This will insert the FIRST value in the table
eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
AND t1.sequence < @temp_sequence
AND t2.client_id = @temp_client_id
AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
为每一行执行并选中变量替换框,但执行似乎不一致或根本没有更新缺失的序列表