sql - 分区表，但基于多列组合在一起

Question

我有一个有趣的问题是将表分成组。我有一群游客 - 每个人都说一种语言和/或是一个家庭的一部分。我需要将表格分成几组，但我想让家人和语言相似的人聚在一起。

假设我想将游客分成最多 3 人的组（如果组必须更大，这是可以接受的）。该解决方案不必太聪明以至于完全填满所有组，但我正在尽最大努力。

输入：

TouristID | LanguageID | FamilyID
---------------------------------
    1     |     1      |    1
    2     |     1      |    1
    3     |     1      |    1
    4     |     2      |    1
    5     |     3      |    2
    6     |     4      |    2
    7     |     5      |    3
    8     |     5      |    4
    9     |     7      |    5

期望的结果：

TouristID | GroupID
-------------------
    1     |    1
    2     |    1
    3     |    1
    4     |    1
    5     |    2
    6     |    2
    7     |    3
    8     |    3
    9     |    2

第 1 组由所有使用 1 语言的人组成，包括一位不能被排除在外的家庭成员。

第 2 组由两名家庭成员 (5, 6) 和一名随机成员 (9) 组成，组成 3 人组。

第 3 组由两个相同语言的人 (7, 8) 组成

我做了什么：

INSERT TouristGroup
SELECT
  t.TouristID,
  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM Tourists t
CROSS APPLY (
  SELECT MIN(TouristID) AS [GroupID]
  FROM Tourists t2
  WHERE
    ( t2.LanguageID = t.LanguageID
    OR t2.FamilyID = t.FamilyID )
) x;

INSERT Groups
SELECT GroupID, COUNT(*)
FROM TouristGroup
GROUP BY GroupID;

declare 
  @matchID int = 0,
  @currentCount int,
  @desiredCount int = 0,
  @candidateGroupID int = null,
  @chunk int = 1

while exists (
  select null
  from Groups g
  left join Matches m
    on m.GroupID = g.GroupID
  where m.GroupID is null
)
begin
  set @currentCount = null
  set @candidateGroupID = null

  select
    @currentCount = isnull(SUM([Count]), 0)
  from Matches m
  join Groups g
    on g.GroupID = m.GroupID
  where m.MatchID = @matchID

  if @CurrentCount is not null
  begin
    set @desiredCount = @chunk - @desiredCount

    select top 1
      @candidateGroupID = g.GroupID
    from Groups g
    left join Matches m
      on m.GroupID = g.GroupID
    where g.[Count] <= @desiredCount
      and m.GroupID is null
    order by [Count] DESC

    if @candidateGroupID is not null
    begin
      insert Matches
      select @matchID, @candidateGroupID
    end
    else begin
      set @matchID = @matchID + 1
    end
  end
  else begin
    set @matchid = @matchID + 1
  end
end

问题

是否有更好的方法来对表进行分区，但根据多列将行分组在一起？

score 1 · Accepted Answer

这将产生您的“第 1 步”。也许它比你现在拥有的更好（没有循环）。

SELECT t.TouristID, DENSE_RANK() OVER (ORDER BY x.GroupNum) as GroupId
FROM Tourists t
CROSS APPLY (SELECT MIN(TouristId) AS GroupNum 
             FROM @Tourist t2 
             WHERE t2.LanguageId = t.LanguageId OR t2.FamilyId = t.FamilyId
            ) x

至于您至少获得至少三个小组成员的其他要求，如果可能的话，您可能必须做一个类似于您正在做的循环（我不确定它是否可以改进，因为您没有' t共享它）。

[更新] 这是我对“第 2 步”的建议：

DECLARE @MinGroupSize int = 3, @rc int = 1
WHILE @rc>0
BEGIN
    WITH GroupCount AS (
    SELECT GroupID, COUNT(*) AS GroupCount
    FROM TouristGroup
    GROUP BY GroupID
    ), CandidateGroups AS (
    SELECT TOP 1 gc1.GroupID AS ShortGroupId, singleton.GroupID as SingletonGroupID
    FROM GroupCount gc1
    CROSS APPLY (SELECT TOP 1 GroupID
                 FROM GroupCount AS gc2
                 WHERE gc2.GroupCount = 1 AND gc2.GroupID != gc1.GroupID
                 ORDER BY gc2.GroupID
                 ) AS singleton
    WHERE gc1.GroupCount < @MinGroupSize
    ORDER BY GroupCount DESC, gc1.GroupID ASC
    )
    UPDATE tg
    SET GroupID = cg.ShortGroupID
    FROM TouristGroup tg
    JOIN CandidateGroups cg ON cg.SingletonGroupID = tg.GroupID;
    SET @rc = @@ROWCOUNT;
END
--
-- If you're anal like me and want to eliminate gaps in GroupID values
--
UPDATE tg
SET GroupID = tg2.GroupID
FROM TouristGroup tg
JOIN (SELECT TouristID,  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
      FROM TouristGroup) AS tg2 ON tg2.TouristID = tg.TouristID
WHERE tg.GroupID != tg2.GroupID;

这将找到小于所需最小组大小的组，并找到一个单例组（只有 1 个成员）并使用另一个 GroupID 更新单例，逐个执行此操作，直到没有更多候选者为止。按顺序选择较小的组（按 GroupCount 降序，然后按 GroupID 升序），以便首先填充较大的组。只选择单例进行更新，这样自然组就不会被分解。

sql - 分区表，但基于多列组合在一起

1 回答 1

Related

Reference