sql - 对多对多关系中的所有相关记录进行分组，SQL 图连接组件

Question

希望我错过了一个简单的解决方案。

我有两张桌子。其中一个包含公司列表。第二个包含发布者列表。两者之间的映射是多对多的。我想做的是捆绑或分组表 A 中与表 B 中的发布者有任何关系的所有公司，反之亦然。

最终结果看起来像这样（GROUPID 是关键字段）。第 1 行和第 2 行属于同一组，因为它们共享同一家公司。第 3 行在同一个组中，因为发布者 Y 已经映射到公司 A。第 4 行在组中，因为公司 B 已经通过发布者 Y 映射到组 1。

简单地说，只要公司和发布者之间存在任何类型的共享关系，就应该将这一对分配到同一个组。

ROW   GROUPID     Company     Publisher
1     1           A           Y
2     1           A           X
3     1           B           Y
4     1           B           Z
5     2           C           W
6     2           C           P
7     2           D           W

小提琴

更新：
我的赏金版本：给定上面小提琴中的简单Company和Publisher配对表，填充GROUPID上面的字段。将其视为创建一个Family包含所有相关父母/孩子的 ID。

SQL Server 2012

score 12 · Accepted Answer

我考虑过使用递归 CTE，但是，据我所知，在 SQL Server 中无法UNION用于连接锚成员和递归 CTE 的递归成员（我认为在 PostgreSQL 中可以做到），所以不可能消除重复。

declare @i int

with cte as (
     select
         GroupID,
         row_number() over(order by Company) as rn
     from Table1
)
update cte set GroupID = rn

select @i = @@rowcount

-- while some rows updated
while @i > 0
begin
    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Company, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Company
        ) as T2 on T2.Company = T1.Company
    where T1.GroupID > T2.GroupID

    select @i = @@rowcount

    update T1 set
        GroupID = T2.GroupID
    from Table1 as T1
        inner join (
            select T2.Publisher, min(T2.GroupID) as GroupID
            from Table1 as T2
            group by T2.Publisher
        ) as T2 on T2.Publisher = T1.Publisher
    where T1.GroupID > T2.GroupID

    -- will be > 0 if any rows updated
    select @i = @i + @@rowcount
end

;with cte as (
     select
         GroupID,
         dense_rank() over(order by GroupID) as rn
     from Table1
)
update cte set GroupID = rn

sql fiddle demo

我还尝试了广度优先搜索算法。我认为它可以更快（在复杂性方面更好），所以我将在这里提供一个解决方案。不过，我发现它并不比 SQL 方法快：

declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int

declare @Queue table (
    Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
    primary key(Company, Publisher)
)

select @GroupID = 0

while 1 = 1
begin
    select top 1 @Company = Company, @Publisher = Publisher
    from Table1
    where GroupID is null

    if @@rowcount = 0 break

    select @GroupID = @GroupID + 1

    insert into @Queue(Company, Publisher)
    select @Company, @Publisher

    while 1 = 1
    begin
        select top 1 @Company = Company, @Publisher = Publisher
        from @Queue
        order by ID asc

        if @@rowcount = 0 break

        update Table1 set
            GroupID = @GroupID
        where Company = @Company and Publisher = @Publisher

        delete from @Queue where Company = @Company and Publisher = @Publisher

        ;with cte as (
            select Company, Publisher from Table1 where Company = @Company and GroupID is null
            union all
            select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
        )
        insert into @Queue(Company, Publisher)
        select distinct c.Company, c.Publisher
        from cte as c
        where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
   end
end

sql fiddle demo

我已经测试了我的版本和 Gordon Linoff 的版本来检查它的性能。看起来 CTE 更糟糕，我等不及它在 1000 多行上完成了。

这是带有随机数据的sql fiddle 演示。我的结果是：
128 行：
我的 RBAR 解决方案：190ms
我的 SQL 解决方案：27ms
Gordon Linoff 的解决方案：958ms
256 行：
我的 RBAR 解决方案：560ms
我的 SQL 解决方案：1226ms
Gordon Linoff 的解决方案：45371ms

这是随机数据，所以结果可能不是很一致。我认为时间可以通过索引来改变，但不认为它可以改变整个画面。

旧版本 - 使用临时表，只计算 GroupID 而不接触初始表：

declare @i int

-- creating table to gather all possible GroupID for each row
create table #Temp
(
    Company varchar(1), Publisher varchar(1), GroupID varchar(1),
    primary key (Company, Publisher, GroupID)
)

-- initializing it with data
insert into #Temp (Company, Publisher, GroupID)
select Company, Publisher, Company
from Table1

select @i = @@rowcount

-- while some rows inserted into #Temp
while @i > 0
begin
    -- expand #Temp in both directions
    ;with cte as (
        select
            T2.Company, T1.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Company = T1.Company
        union
        select
            T1.Company, T2.Publisher,
            T1.GroupID as GroupID1, T2.GroupID as GroupID2
        from #Temp as T1
            inner join #Temp as T2 on T2.Publisher = T1.Publisher        
    ), cte2 as (
        select
            Company, Publisher,
            case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
        from cte
    )
    insert into #Temp
    select Company, Publisher, GroupID
    from cte2
    -- don't insert duplicates
    except
    select Company, Publisher, GroupID
    from #Temp

    -- will be > 0 if any row inserted
    select @i = @@rowcount
end

select
    Company, Publisher,
    dense_rank() over(order by min(GroupID)) as GroupID
from #Temp
group by Company, Publisher

=> sql 小提琴示例

score 6 · Accepted Answer

您的问题是查找连接子图的图行走问题。这更具挑战性，因为您的数据结构有两种类型的节点（“公司”和“出版商”），而不是一种类型。

您可以使用单个递归 CTE 解决此问题。逻辑如下。

首先，将问题转换为只有一种类型节点的图。我通过使用发布者信息使节点公司和公司之间的边缘链接来做到这一点。这只是一个连接：

      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     )

（为了效率起见，您也可以添加t1.company <> t2.company，但这不是绝对必要的。）

现在，这是一个“简单”的图行走问题，其中递归 CTE 用于创建两个节点之间的所有连接。递归 CTE 使用遍历图join。在此过程中，它会保留所有访问过的节点的列表。在 SQL Server 中，这需要存储在字符串中。

代码需要确保它不会针对给定路径访问节点两次，因为这可能导致无限递归（和错误）。如果调用上述方法edges，则生成所有连接节点对的 CTE 如下所示：

     cte as (
      select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     )

现在，使用这个连接节点列表，为每个节点分配它连接到的所有节点中的最小值，包括它自己。这用作连接子图的标识符。也就是说，所有通过发布者相互连接的公司都将具有相同的最小值。

最后两个步骤是枚举这个最小值（作为GroupId）并将GroupId后面连接到原始数据。

完整的（我可能会添加经过测试的）查询如下所示：

with edges as (
      select t1.company as node1, t2.company as node2
      from table1 t1 join
           table1 t2
           on t1.publisher = t2.publisher
     ),
     cte as (
      select e.node1, e.node2,
             cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
             1 as level
      from edges e
      union all
      select c.node1, e.node2,
             c.nodes+e.node2+'|',
             1+c.level
      from cte c join
           edges e
           on c.node2 = e.node1 and
              c.nodes not like '|%'+e.node2+'%|'
     ),
     nodes as (
       select node1,
              (case when min(node2) < node1 then min(node2) else node1 end
              ) as grp
       from cte
       group by node1
      )
select t.company, t.publisher, grp.GroupId
from table1 t join
     (select n.node1, dense_rank() over (order by grp) as GroupId
      from nodes n
     ) grp
     on t.company = grp.node1;

请注意，这适用于查找任何连接的子图。它不假定任何特定数量的级别。

编辑：

这方面的性能问题令人烦恼。至少，上面的查询将在索引上运行得更好Publisher。更好的是接受@MikaelEriksson 的建议，并将边缘放在单独的表格中。

另一个问题是您是否在公司或出版商之间寻找等价类。我采用了使用 Companies 的方法，因为我认为这具有更好的“可解释性”（我的回应倾向是基于许多评论，即 CTE 无法做到这一点）。

我猜您可以从中获得合理的性能，尽管这需要比 OP 中提供的更多的数据和系统知识。但是，最好的性能很可能来自多查询方法。

score 2 · Accepted Answer

这是我的解决方案SQL Fiddle

正如我所想的那样，关系的性质需要循环。

这是SQL：

--drop TABLE Table1

CREATE TABLE Table1
    ([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
;

INSERT INTO Table1
    (Company, Publisher)
select
    left(newid(), 2), left(newid(), 2)

declare @i int = 1

while @i < 8
begin
    ;with cte(Company, Publisher) as (
        select
            left(newid(), 2), left(newid(), 2)
        from Table1
    )
    insert into Table1(Company, Publisher)
    select distinct c.Company, c.Publisher
    from cte as c
    where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)

    set @i = @i + 1
end;


CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)

declare @counter int=0
declare @row int=0
declare @lastnullcount int=0
declare @currentnullcount int=0

WHILE EXISTS (
  SELECT *
  FROM Table1
  where GroupID is null
  )
BEGIN
    SET @counter=@counter+1
    SET @lastnullcount =0

    SELECT TOP 1
        @row=[row]
    FROM Table1
    where GroupID is null
    order by [row] asc

    SELECT @currentnullcount=count(*) from table1 where groupid is null
    WHILE @lastnullcount <> @currentnullcount
    BEGIN
        SELECT @lastnullcount=count(*)
        from table1
        where groupid is null 

        UPDATE Table1
        SET GroupID=@counter
        WHERE [row]=@row

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.Company=t2.Company
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        UPDATE t2
        SET t2.GroupID=@counter
        FROM Table1 t1
        INNER JOIN Table1 t2 on t1.publisher=t2.publisher
        WHERE t1.GroupID=@counter
        AND t2.GroupID IS NULL

        SELECT @currentnullcount=count(*)
        from table1
        where groupid is null
    END
END

SELECT * FROM Table1

编辑：在真实表上添加了我期望的索引，并且更符合 Roman 正在使用的其他数据集。

score 0 · Accepted Answer

您正在尝试查找图形的所有连接组件，这只能迭代完成。如果您知道任何连接组件的最大宽度（即您必须从一家公司/出版商到另一家公司/出版商的最大链接数），原则上您可以这样做：

SELECT
    MIN(x2.groupID) AS groupID,
    x1.Company,
    x1.Publisher
FROM Table1 AS x1
    INNER JOIN (
        SELECT
            MIN(x2.Company) AS groupID,
            x1.Company,
            x1.Publisher
        FROM Table1 AS x1
            INNER JOIN Table1 AS x2
            ON x1.Publisher = x2.Publisher
        GROUP BY
            x1.Publisher,
            x1.Company
    ) AS x2
    ON x1.Company = x2.Company
GROUP BY
    x1.Publisher,
    x1.Company;

您必须将子查询（公司和发布者上的交替连接，并且最深的子查询说 MIN(Company) 而不是 MIN(groupID)）嵌套到最大迭代深度。

不过，我真的不推荐这个；在 SQL 之外执行此操作会更干净。

免责声明：我对 SQL Server 2012（或任何其他版本）一无所知；它可能具有某种额外的脚本功能，可以让您动态地进行此迭代。

score 0 · Accepted Answer

挑战有点晚了，因为 SQLFiddle 似乎在 ATM 上失败了，我不得不猜测你的数据结构。尽管如此，这似乎是一个有趣的挑战（它是=）所以这就是我从中做出的：

设置：

IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
GO

CREATE TABLE t_company ( company_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                         company_name   varchar(100) NOT NULL)

GO 

CREATE TABLE t_publisher (publisher_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                          publisher_name   varchar(100) NOT NULL)

CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
                     publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
                                PRIMARY KEY (company_id, publisher_id),
                     group_id int NULL
                             )
GO

-- example content


-- ROW   GROUPID     Company     Publisher
--1     1           A           Y
--2     1           A           X
--3     1           B           Y
--4     1           B           Z
--5     2           C           W
--6     2           C           P
--7     2           D           W


INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')

INSERT t_link (company_id, publisher_id)
SELECT company_id, publisher_id
  FROM t_company, t_publisher
 WHERE (company_name = 'A' AND publisher_name = 'Y')
    OR (company_name = 'A' AND publisher_name = 'X')
    OR (company_name = 'B' AND publisher_name = 'Y')
    OR (company_name = 'B' AND publisher_name = 'Z')
    OR (company_name = 'C' AND publisher_name = 'W')
    OR (company_name = 'C' AND publisher_name = 'P')
    OR (company_name = 'D' AND publisher_name = 'W')




GO

/*
-- volume testing

TRUNCATE TABLE t_link
DELETE t_company
DELETE t_publisher


DECLARE @company_count   int = 1000,
        @publisher_count int = 450,
        @links_count     int = 800


INSERT t_company (company_name)
SELECT company_name    = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @company_count) 

UPDATE STATISTICS t_company

INSERT t_publisher (publisher_name)
SELECT publisher_name  = Convert(varchar(100), NewID())
  FROM master.dbo.fn_int_list(1, @publisher_count) 

UPDATE STATISTICS t_publisher

-- Random links between the companies & publishers

DECLARE @count int
SELECT @count = 0

WHILE @count < @links_count
    BEGIN

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
          INTO #link_A
          FROM t_company
         ORDER BY NewID()

        SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
          INTO #link_B
          FROM t_publisher
         ORDER BY NewID()

        INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
        SELECT A.company_id,
               B.publisher_id
          FROM #link_A A
          JOIN #link_B B
            ON A.row_id = B.row_id
         WHERE NOT EXISTS ( SELECT *
                              FROM t_link old
                             WHERE old.company_id   = A.company_id
                               AND old.publisher_id = B.publisher_id)

        SELECT @count = @count + @@ROWCOUNT

        DROP TABLE #link_A
        DROP TABLE #link_B    
    END

*/

实际分组：

IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
GO

-- apply grouping

-- init
SELECT row_id = IDENTITY(int, 1, 1), 
       company_id,
       publisher_id,
       group_id = 0
  INTO #links
  FROM t_link

-- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
CREATE CLUSTERED INDEX idx0 ON #links (row_id)
--CREATE INDEX idx1 ON #links (company_id)   
--CREATE INDEX idx2 ON #links (publisher_id)

UPDATE #links
   SET group_id = row_id


-- start grouping
WHILE @@ROWCOUNT > 0
    BEGIN  
        UPDATE #links
           SET group_id = new_group_id
          FROM #links upd
          CROSS APPLY (SELECT new_group_id = Min(group_id)
                         FROM #links new
                        WHERE new.company_id   = upd.company_id
                           OR new.publisher_id = upd.publisher_id 
                                     ) x
        WHERE upd.group_id > new_group_id

        -- select * from #links
    END


-- remove 'holes'
UPDATE #links
   SET group_id = (SELECT COUNT(DISTINCT o.group_id) 
                          FROM #links o
                         WHERE o.group_id <= upd.group_id)
  FROM #links upd

GO

UPDATE t_link
   SET group_id = new.group_id
  FROM t_link upd
  LEFT OUTER JOIN #links new
               ON new.company_id = upd.company_id
              AND new.publisher_id = upd.publisher_id

GO    
SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
       l.group_id,
       c.company_name, -- c.company_id,
       p.publisher_name -- , p.publisher_id
 from t_link l
 JOIN t_company c
   ON l.company_id = c.company_id
 JOIN t_publisher p 
   ON p.publisher_id = l.publisher_id
 ORDER BY 1

乍一看，这种方法还没有被其他人尝试过，有趣的是看看如何以各种方式完成......（最好不要提前阅读它们，因为它会破坏谜题=）

结果看起来符合预期（据我了解要求和示例），性能也不是太差，尽管没有真正表明应该处理的记录数量；不确定它会如何扩展，但也不要指望有太多问题......

score 0 · Accepted Answer

这是一个递归解决方案，使用 XML：

with a as ( -- recursive result, containing shorter subsets and duplicates
    select cast('<c>' + company + '</c>' as xml) as companies
          ,cast('<p>' + publisher + '</p>' as xml) as publishers
      from Table1

    union all

    select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
                                                        sql:column("t.company")))
                          order by $c
                          return <c>{$c}</c>')
          ,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
                                                         sql:column("t.publisher")))
                          order by $p
                          return <p>{$p}</p>')
    from a join Table1 t
      on (   a.companies.exist('/c[text() = sql:column("t.company")]') = 0 
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
     and (   a.companies.exist('/c[text() = sql:column("t.company")]') = 1
          or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
    select distinct -- distinct cannot work on xml types, hence cast to nvarchar
           cast(companies as nvarchar) as companies
          ,cast(publishers as nvarchar) as publishers
          ,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
     from a
    where not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.companies as varchar)
                                 + '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
                             ).value('if((count(/s/c) > count(/a/c))
                                         and (some $s in /s/c/text() satisfies
                                             (some $a in /a/c/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
      and not exists (select 1 from a as s -- s is a proper subset of a
                       where (cast('<s>' + cast(s.publishers as nvarchar)
                                 + '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
                             ).value('if((count(/s/p) > count(/a/p))
                                         and (some $s in /s/p/text() satisfies
                                             (some $a in /a/p/text() satisfies $s = $a))
                                      ) then 1 else 0', 'int') = 1
                     )
), c as (  -- cast back to xml
    select cast(companies as xml) as companies
          ,cast(publishers as xml) as publishers
          ,groupid
      from b
)
select Co.company.value('(./text())[1]', 'varchar') as company
      ,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
      ,c.groupid
  from c
       cross apply companies.nodes('/c') as Co(company)
       cross apply publishers.nodes('/p') as Pu(publisher)
 where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
               where t.company = Co.company.value('(./text())[1]', 'varchar')
                 and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
             )

公司集和发布者集在中间步骤中保存在 XML 字段中，由于 SQL Server 的某些限制（例如无法distinct对 XML 列进行分组或使用），xml 和 nvarchar 之间需要进行一些转换。

sql - 对多对多关系中的所有相关记录进行分组，SQL 图连接组件

6 回答 6

Related

Reference