您指定
如何编写可以选择与特定订单至少 85% 相似的所有订单的查询?
与“至少 85% 相似的所有订单对”相比,这是一个重要的简化。
我们将使用一些 TDQD(测试驱动查询设计)和一些分析来帮助我们。
预赛
要远程相似,两个订单必须至少有一个共同点。此查询可用于确定哪些订单与指定订单至少有一个共同点:
SELECT DISTINCT I1.OrderID AS ID
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
这会大量删除要检查的其他订单列表,但如果指定的订单包含您最受欢迎的商品之一,则很可能很多其他订单也这样做了。
代替 DISTINCT,您可以使用:
SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
这为您提供了与指定订单相同的订单中的项目数。我们还需要每个订单中的商品数量:
SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID;
相同的订单
对于 100% 的相似性,两个订单的共同项目与每个订单的项目数量一样多。不过,这可能不会找到很多订单对。我们可以很容易地找到与指定订单具有完全相同项目的订单:
SELECT L1.ID
FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID
) AS L1
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;
编辑:事实证明这不够严格;为了使订单相同,指定订单中的商品数量也必须与公共数量相同:
SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total
FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
GROUP BY OrderID
) AS L1
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common
JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS L3 ON L2.Num_Common = L3.Num_Total;
相似的顺序——分析公式
将
Wikipedia 中定义的Jaccard 相似性应用于两个订单 A 和 B,其中 |A| 作为订单 A 中项目数的计数,Jaccard 相似度J(A,B) = |A∩B| ÷ |A∪B| , 其中 |A∩B| 是两个订单共有的项目数,|A∪B| 是订购的不同项目的总数。
为了满足 85% 的 Jaccard 相似性标准,如果任一订单中的项目数小于某个阈值,则订单必须相同。例如,如果订单 A 和 B 都有 5 件商品,但两者之间有一件商品不同,那么它会给您 4 件共同的商品(|A∩B|)和总共 6 件商品(|A∪B|) ,所以 Jaccard 相似度 J(A,B) 只有 66⅔%。
对于 85% 的相似性,当两个订单中的每个订单中有 N 个项目且 1 个项目不同时,(N-1) ÷ (N+1) ≥ 0.85,这意味着 N > 12(准确地说是 12⅓)。对于分数 F = J(A,B),一项不同的意思是(N-1) ÷ (N+1) ≥ F可以解决 N 给出N ≥ (1 + F) ÷ (1 - F)。随着相似性要求的提高,对于越来越大的 N 值,阶数必须相同。
进一步概括,假设我们有 N 和 M 个项目的不同大小订单(不失一般性,N < M)。|A∩B|的最大值 现在是 N 和 |A∪B| 的最小值 是M(意思是小顺序的所有项目都以大的顺序出现)。让我们定义 M = N + Δ,并且有 ∂ 项以较小的顺序出现,而在较大的顺序中不存在。由此可见,有 ∆+∂ 项以较大的顺序出现,而没有以较小的顺序出现。
根据定义,|A∩B| = N-∂和|A∪B| = (N-∂) + ∂ + (N+Δ-(N-∂)),其中三个相加的项表示(1)两个订单共有的项目数,(2)仅在较小的顺序,以及 (3) 仅在较大顺序中的项目数。这简化为:|A∪B| = N+Δ+∂。
关键方程
对于相似性分数 F,我们对 J(A,B) ≥ F 的顺序对感兴趣,因此:
(N-∂) ÷ (N+Δ+∂) ≥ F
F ≤ (N-∂) ÷ (N+Δ+∂)
我们可以使用电子表格来绘制这些之间的关系。对于给定数量的较小顺序(x 轴)的项目,并且对于给定的相似度,我们可以绘制 ∂ 的最大值,它给出了 F 的相似度。公式是:
∂ = (N(1-F) - FΔ) ÷ (1+F)
这是常数 F 的 N 和 Δ 的线性方程;对于不同的 F 值,它是非线性的。显然,∂ 必须是非负整数。
给定 F = 0.85,对于相同大小的订单 (Δ=0),对于 1 ≤ N < 13,∂ = 0;对于 13 ≤ N < 25,∂ ≤ 1;对于 25 ≤ N < 37,∂ ≤ 2,对于 37 ≤ N < 50,∂ ≤ 3。
对于相差 1 (Δ=1) 的阶数,对于 1 ≤ N < 18,∂ = 0;对于 18 ≤ N < 31,∂ ≤ 1;对于 31 ≤ N < 43,∂ ≤ 2;等等。如果 ∆=6,您需要 N=47 才能使订单与 ∂=1 有 85% 的相似度。这意味着小订单有 47 件商品,其中 46 件与 53 件商品的大订单相同。
类似订单——应用分析
到目前为止,一切都很好。我们如何应用该理论来选择与指定订单相似的订单?
首先,我们观察到指定的订单可能与类似订单的大小相同,或者更大或更小。这使事情变得有点复杂。
上式的参数为:
- N - 较小顺序的项目数
- ∆ - 大订单项数与 N 之间的差异
- F——固定
- ∂ — 较小顺序中未匹配较大顺序的项目数
使用顶部开发的查询的微小变化可用的值:
- NC——共同项目的数量
- NA — 指定顺序的项目数
- NB - 比较顺序中的项目数
对应查询:
SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID;
SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID;
SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
为方便起见,我们希望值 N 和 N+Δ(因此也有 Δ)可用,因此我们可以使用 UNION 来适当地安排事物,其中:
- NS = N - 较小顺序的项目数
- NL = N + Δ - 较大顺序的项目数
在 UNION 查询的第二个版本中,使用:
这两个查询都保留了两个订单 ID 号,以便您以后可以追踪到其余的订单信息。
SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB
这为我们提供了一个包含 OrderID_1、NS、OrderID_2、NL 列的表表达式,其中 NS 是小订单中的项目数,NL 是大订单中的项目数。由于 v1 和 v2 表表达式生成的订单号没有重叠,因此无需担心 OrderID 值相同的“反身”条目。在 UNION 查询中添加 NC 也是最容易处理的:
SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v1.ID
这为我们提供了一个包含 OrderID_1、NS、OrderID_2、NL、NC 列的表表达式,其中 NS 是小订单中的项目数,NL 是大订单中的项目数,NC 是小订单中的项目数常见的。
给定 NS、NL、NC,我们正在寻找满足:
(N-∂) ÷ (N+Δ+∂) ≥ F。
- N - 较小顺序的项目数
- ∆ - 大订单项数与 N 之间的差异
- F——固定
∂ — 较小顺序中未匹配较大顺序的项目数
NS = N - 较小顺序的项目数
- NL = N + Δ - 较大顺序的项目数
- NC = N - ∂ — 共有项目数
因此,条件必须是:
NC / (NL + (NS - NC)) ≥ F
LHS 上的项必须作为浮点数计算,而不是整数表达式。将其应用于上面的 UNION 查询,产生:
SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA <= v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM (SELECT OrderID AS ID, COUNT(*) AS NA
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
) AS v1
JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
) AS v2
ON v1.NA > v2.NB
JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
) AS v3
ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
您可能会观察到此查询仅使用 OrderItem 表;不需要 Order 和 Item 表。
警告:部分测试的 SQL(警告讲师)。上面的 SQL 现在似乎可以在微小的数据集上产生似是而非的答案。我调整了相似性要求(0.25,然后是 0.55)并得到了合理的值和适当的选择性。但是,我的测试数据最大的顺序只有 8 个项目,而且肯定没有涵盖所描述数据的全部范围。由于我最常使用的 DBMS 不支持 CTE,因此下面的 SQL 未经测试。但是,我有一定的信心,除非我犯了一个大错误,否则版本 1 中的 CTE 代码(指定订单 ID 有很多重复)应该是干净的。我认为第 2 版也可能没问题,但是……它未经测试。
可能有更紧凑的方式来表达查询,可能使用 OLAP 函数。
如果我要对此进行测试,我将创建一个包含几个具有代表性的订单项集的表,检查返回的相似性度量是否合理。我会或多或少地处理查询,逐渐建立复杂的查询。如果其中一个表达式被证明有缺陷,那么我会对该查询进行适当的调整,直到缺陷得到修复。
显然,性能将是一个问题。最里面的查询不是非常复杂,但也不是微不足道的。然而,测量将显示这是一个严重的问题还是只是一个麻烦。研究查询计划可能会有所帮助。OrderItem.OrderID 上应该有一个索引似乎很可能;如果没有这样的索引,查询不太可能执行良好。这不太可能成为问题,因为它是一个外键列。
您可能会从使用“WITH 子句”(公用表表达式)中获得一些好处。他们将明确表示 UNION 子查询的两半中隐含的重复。
使用公用表表达式
当表达式相同时,使用公用表表达式可以向优化器澄清,并可能有助于它更好地执行。它们还帮助人们阅读您的查询。上面的查询确实要求使用 CTE。
版本一:重复指定订单号
WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA -- Specified Order (SO)
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
),
OO AS (SELECT OrderID AS ID, COUNT(*) AS NB -- Other orders (OO)
FROM OrderItem
WHERE OrderID != <specified order ID>
GROUP BY OrderID
),
CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI)
FROM OrderItem AS I1
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
WHERE I1.OrderID != <specified order ID>
GROUP BY I1.OrderID
)
SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA <= v2.NB
JOIN CI AS v3 ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA > v2.NB
JOIN CI AS v3 ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
版本 2:避免重复指定的订单号
WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA -- Specified Order (SO)
FROM OrderItem
WHERE OrderID = <specified order ID>
GROUP BY OrderID
),
OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB -- Other orders (OO)
FROM OrderItem AS OI
JOIN SO ON OI.OrderID != SO.ID
GROUP BY OI.OrderID
),
CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC -- Common Items (CI)
FROM OrderItem AS I1
JOIN SO AS S1 ON I1.OrderID != S1.ID
JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID
JOIN SO AS S2 ON I2.OrderID = S2.ID
GROUP BY I1.OrderID
)
SELECT OrderID_1, NS, OrderID_2, NL, NC,
CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA <= v2.NB
JOIN CI AS v3 ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
FROM SO AS v1
JOIN OO AS v2 ON v1.NA > v2.NB
JOIN CI AS v3 ON v3.ID = v1.ID
) AS u
WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
这些都不是易读的。两者都比完整写出 CTE 的大 SELECT 更容易。
最少的测试数据
这不足以进行良好的测试。它提供了一点点信心(它确实显示了“相同顺序”查询的问题。
CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE Item (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE OrderItem
(
OrderID INTEGER NOT NULL REFERENCES Order,
ItemID INTEGER NOT NULL REFERENCES Item,
Quantity DECIMAL(8,2) NOT NULL
);
INSERT INTO Order VALUES(1);
INSERT INTO Order VALUES(2);
INSERT INTO Order VALUES(3);
INSERT INTO Order VALUES(4);
INSERT INTO Order VALUES(5);
INSERT INTO Order VALUES(6);
INSERT INTO Order VALUES(7);
INSERT INTO Item VALUES(111);
INSERT INTO Item VALUES(222);
INSERT INTO Item VALUES(333);
INSERT INTO Item VALUES(444);
INSERT INTO Item VALUES(555);
INSERT INTO Item VALUES(666);
INSERT INTO Item VALUES(777);
INSERT INTO Item VALUES(888);
INSERT INTO Item VALUES(999);
INSERT INTO OrderItem VALUES(1, 111, 1);
INSERT INTO OrderItem VALUES(1, 222, 1);
INSERT INTO OrderItem VALUES(1, 333, 1);
INSERT INTO OrderItem VALUES(1, 555, 1);
INSERT INTO OrderItem VALUES(2, 111, 1);
INSERT INTO OrderItem VALUES(2, 222, 1);
INSERT INTO OrderItem VALUES(2, 333, 1);
INSERT INTO OrderItem VALUES(2, 555, 1);
INSERT INTO OrderItem VALUES(3, 111, 1);
INSERT INTO OrderItem VALUES(3, 222, 1);
INSERT INTO OrderItem VALUES(3, 333, 1);
INSERT INTO OrderItem VALUES(3, 444, 1);
INSERT INTO OrderItem VALUES(3, 555, 1);
INSERT INTO OrderItem VALUES(3, 666, 1);
INSERT INTO OrderItem VALUES(4, 111, 1);
INSERT INTO OrderItem VALUES(4, 222, 1);
INSERT INTO OrderItem VALUES(4, 333, 1);
INSERT INTO OrderItem VALUES(4, 444, 1);
INSERT INTO OrderItem VALUES(4, 555, 1);
INSERT INTO OrderItem VALUES(4, 777, 1);
INSERT INTO OrderItem VALUES(5, 111, 1);
INSERT INTO OrderItem VALUES(5, 222, 1);
INSERT INTO OrderItem VALUES(5, 333, 1);
INSERT INTO OrderItem VALUES(5, 444, 1);
INSERT INTO OrderItem VALUES(5, 555, 1);
INSERT INTO OrderItem VALUES(5, 777, 1);
INSERT INTO OrderItem VALUES(5, 999, 1);
INSERT INTO OrderItem VALUES(6, 111, 1);
INSERT INTO OrderItem VALUES(6, 222, 1);
INSERT INTO OrderItem VALUES(6, 333, 1);
INSERT INTO OrderItem VALUES(6, 444, 1);
INSERT INTO OrderItem VALUES(6, 555, 1);
INSERT INTO OrderItem VALUES(6, 777, 1);
INSERT INTO OrderItem VALUES(6, 888, 1);
INSERT INTO OrderItem VALUES(6, 999, 1);
INSERT INTO OrderItem VALUES(7, 111, 1);
INSERT INTO OrderItem VALUES(7, 222, 1);
INSERT INTO OrderItem VALUES(7, 333, 1);
INSERT INTO OrderItem VALUES(7, 444, 1);
INSERT INTO OrderItem VALUES(7, 555, 1);
INSERT INTO OrderItem VALUES(7, 777, 1);
INSERT INTO OrderItem VALUES(7, 888, 1);
INSERT INTO OrderItem VALUES(7, 999, 1);
INSERT INTO OrderItem VALUES(7, 666, 1);