sql - How do I convert old data constraints in-order to de-duplicate data while keeping referential integrity

Question

PreNote: aka washing of hands; this is work being done on a Brownfield Project

I have a "ProductLine" table as follows

| ProductLineID (pk) | ProductID (fk) | ResellerID (fk) | Other stuff |
|--------------------|----------------|-----------------|-------------|
| 1                  | 28             | 298818          |    --       |

The current system has a table of Product Template lines, that creates a set of product lines each time a new reseller is created, linked to that reseller. The idea being that if that Reseller wished to edit the product for their organisation it would be displayed based on their account.

These product lines are used on the sale line table which is linked to a sale table(which is linked to the cart table).

There are a couple of tables hooked up to the product lines for various reasons.

What I was looking at doing was making a de-duped copy of the product lines and dropping some of the data so a new line would only be created IF the reseller made a change; thus reducing the page from > 124,000 rows down to 69 (no ones used the functionality in 5 years).

Then using the old ProductLine table as a reference, altering the existing data (ProductLineId's in the sale line table) to point to the new ProductLineID, by reading the original lines ProductID and finding the new matching LineID (one per product funnily enough).

I was wondering what the best way to do this would be; a cursor springs to mind but tends to bring out DBA's from far and wide with a pitchfork, and I'll probably need to do a similar query on several tables so the less painful the SQL the better.

Just to make the visualisation a little easier the sale line is like this

| SaleLineId (pk) | SaleID (fk) | ProductLineId (fk) | Price |
|-----------------|-------------|--------------------|-------|
| 1992            | 29          | 10283              | 9.00  |

Extra

I plan to rename the old ProductLine table to LegacyProductLine. Then dedupe + insert the product lines from there into a clean ProductLineTable.

I then need to replace the ProductLineId's in the SalesLine (and others) with the new ProductLineId.

The LegacyProductLine wont know what the ProductLineID is in the ProductLineTable; hence I was looking at the ProductID as a way of matching them up as there is no other matching parameters.


    +-----------------+     +-----------------+          +------------------+
    |LegacyProductLine|     | ProductLine     |          |  SaleLine        |
    |-----------------|     |-----------------|          |------------------|
    |ProductLineId PK |     | ProductLineID PK|          | SaleLineId    PK |
    |ProductName      |     | ProductName     |          | ProductLineId FK |
    |... some stuff   |     | ... Some stuff  |          | Charge           |
    |ResellerID  FK   |     |                 |          |                  |
    |ProductID FK     |     | ProductId       |          |                  |
    |                 |     |                 |          |                  |
    |                 |     |                 |          |                  |
    |                 |     |                 |          |                  |
    |                 |     |                 |          |                  |
    |                 |     |                 |          +------------------+
    |                 |     |                 |
    |                 |     |                 |
    |                 |     |                 |
    |                 |     |                 |
    +-----------------+     +-----------------+
     200K rows               26 Rows
     Mostly Duplicates       Deduped Data

The legacy table is temporary only, for reference, and will be deleted. I need to change the ProductLineID in the SaleLine Table.

The SaleLine Table currently contains the ProductLineId's from the Legacy Table; These need updating to use the ProductLineId's in the ProductLine table.

score 1 · Accepted Answer

听上去，我不确定你甚至需要一个循环*。这是我基于以下假设提出的解决方案

当您使用去重数据创建新的 ProductLine (PL) 表时，您需要创建一个从 NewPL 到 OldPL 的映射表 (Map_OldPL_NewPL)。这使问题变得微不足道：

UPDATE SalesLine
SET PLId = NewPLId
FROM SalesLine
    JOIN Map_OldPL_NewPL AS Map
        ON SalesLine.PLId = OldPLId

但是，请在下面澄清我的假设，因为我猜您正在询问有关如何对 ProductLine 进行重复数据删除的更多信息，因为该解决方案非常简单。

*假设您已经有了创建重复数据删除产品线表的机制。但是，也许这就是你要问的，在这种情况下，你能否澄清一下，以防止其他人假设相同:)。在这种情况下，我将不得不扩大我的答案:)

更新：

这是完整的答案。您可能可以在一两个查询中完成所有这些操作，但是这样您就可以随时查看映射表。我假设它是重复的，如果除了 PK (ProductLineId) 之外的所有内容都相同。如果没有，那么您将需要修改 ROW_NUMBER 分区和以下更新。

CREATE TABLE DuplicateMapping
(
    OldProductLineId INT, 
    ProductName VARCHAR(MAX), 
    ... , 
    ResellerId INT, 
    ProductId INT
    DuplicateHierarchy INT,
    NewProductLineId INT
)

INSERT INTO DuplicateMapping
SELECT  ProductLineId AS OldProductLineId, ProductName, ... , ResellerId, ProductId, 
    ROW_NUMBER() OVER 
        (PARTITION BY ProductName, 
            ... , ResellerId, ProductId ORDER BY ProductLineId) AS DuplicateHierarchy,
    ProductLineId AS NewProductLineId
FROM ProductLine

UPDATE DuplicateMapping
SET NewProductLineId = Dup.OldProductLine
FROM DuplicateMapping AS Main
    JOIN DuplicateMapping AS Dup
        ON DuplicateMapping.ProductName = Dup.ProductName
            AND DuplicateMapping.ResellerId = Dup.ResellerId
            AND DuplicateMapping.ProductId = Dup.ProductId
            ...
            --Do NOT include OldProductLineId, NewProductLineId or DuplicateHierarchy
WHERE Dup.DuplicateHierarchy = 1

DELETE ProductLine
WHERE EXISTS 
(
    SELECT 1 
    FROM DuplicateMapping
    WHERE DuplicateMapping.ProductLineId = ProductLine.ProductLineId
        AND DuplicateMapping.DuplicateHierarchy > 1
)

UPDATE SaleLine
SET ProductLineId = NewProductLineId
FROM SaleLine
    JOIN DuplicateMapping
        ON ProductLineId = OldProductLineId
--Without this, you would not cause any harm
--Howerver, why update the same value over itself 
WHERE DuplicateHierarchy > 1

sql - How do I convert old data constraints in-order to de-duplicate data while keeping referential integrity

1 回答 1

Related

Reference