1

首先,我将解释我需要做什么,然后我认为我可以实现它。我目前的计划在理论上似乎效率很低,所以我的问题是是否有更好的方法来完成它。

我有 2 个表 - 我们称它们为“Products”和“Products_Temp”,两者都是相同的。我需要从供应商处下载包含产品详细信息(库存、定价等)的大量文件(XML 或 XLS)。然后将它们解析到 Products_Temp 表中。现在,我计划使用 CF 计划任务来处理下载,并使用 Navicat 来进行实际的解析——我很高兴这足够有效。

下一步是我苦苦挣扎的地方——一旦文件被下载和解析,我需要寻找数据中的任何变化。这将与 Products 表进行比较。如果发现更改,则应该添加或更新该行(如果应该删除它,那么我需要标记它而不是仅仅删除它)。比较所有数据后,应该清空 products_temp 表。

我知道比较表并相应地同步它们的方法,但是我遇到的问题是我将处理来自不同来源的多个文件。我曾考虑仅使用产品表和追加/更新,但我不确定如何管理“已删除标志”要求。

现在,我知道我可以让它工作的唯一方法是遍历 products_temp 表,执行各种 cfquery 并在完成后删除该行。但是,这似乎非常低效,而且考虑到我们可能要处理数十万行,如果我们每天更新所有内容,就不太可能有效。

任何关于更好路线的指示或建议将不胜感激!

4

3 回答 3

2

为了找到更改,我会根据您要匹配的字段查看连接。这可能很慢,具体取决于字段的数量以及它们是否被索引,但我仍然会说它比循环快。类似于以下内容:

SELECT product_id
FROM Products
WHERE product_id NOT IN (
    SELECT T.product_id
    FROM Products_Temp T
    INNER JOIN PRODUCTS P
    ON (
        P.field1 = T.field1
        AND P.field2 = T.field2
        ...
    )
)

对于缺少的产品来查找不匹配项:

SELECT P.product_id
FROM Products P
LEFT OUTER JOIN Products_Temp T
ON (P.field1 = T.field1
    AND P.field2 = T.field2
    ...)
WHERE T.product_id IS NULL
于 2012-05-22T13:39:38.427 回答
2

两种反应都有可能。只是为了稍微扩展您的选择..

选项1

IF mySQL supports some sort of hashing, on a per row basis, you could use a variation of comodoro's suggestion to avoid hard deletes.

Identify Changed

To identify changes, do an inner join on the primary key and check the hash values. If they are different, the product was changed and should be updated:

    UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.ProductName = tmp.ProductName
           , p.Stock = tmp.Stock
           , ...
           , p.DateLastChanged = now()
           , p.IsDiscontinued  = 0
    WHERE  tmp.TheRowHash <> p.TheRowHash

Identify Deleted

Use a simple outer join to identify records that do not exist in the temp table, and flag them as "deleted"

    UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.DateLastChanged = now()
           , p.IsDiscontinued = 1
    WHERE  tmp.ProductID IS NULL

Identify New

Finally, use a similar outer join to insert any "new" products.

    INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
    SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
    FROM   Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
    WHERE  p.ProductID IS NULL

Option #2

If per row hashing is not feasible, an alternate approach is a variation of Sharondio's suggestion.

Add a "status" column to the temp table and flag all imported records as "new", "changed" or "unchanged" through a series of joins. (The default should be "changed").

Identify UN-Changed

First use an inner join, on all fields, to identify products that have NOT changed. (Note, if your table contains any nullable fields, remember to use something like coalesce Otherwise, the results may be skewed because null values are not equal to anything.

    UPDATE  Products_Temp tmp INNER JOIN Products p ON tmp.ProductID = p.ProductID
    SET     tmp.Status = 'Unchanged'
    WHERE   p.ProductName = tmp.ProductName
    AND     p.Stock = tmp.Stock
    ... 

Identify New

Like before, use an outer join to identify "new" records.

    UPDATE  Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
    SET     tmp.Status = 'New'
    WHERE   p.ProductID IS NULL

By process of elimination, all other records in the temp table are "changed". Once you have calculated the statuses, you can update the Products table:

    /*  update changed products */
    UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.ProductName = tmp.ProductName
           , p.Stock = tmp.Stock
           , ...
           , p.DateLastChanged = now()
           , p.IsDiscontinued = 0
    WHERE  tmp.status = 'Changed'

    /*  insert new products */
    INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
    SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
    FROM   Products_Temp tmp
    WHERE  tmp.Status = 'New'

    /* flag deleted records */
    UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
    SET    p.DateLastChanged = now()
           , p.IsDiscontinued = 1
    WHERE  tmp.ProductID IS NULL
于 2012-05-23T08:32:46.127 回答
1

我不得不解决一次类似的问题,也许该解决方案适用于您的情况(我不太了解 Coldfusion)。为什么不(对于每个来源)从表 Products 中删除与该来源相对应的所有内容,并将其替换为来自同一来源的 Products_Temp ?它假定您可以为每个来源创建一个唯一字段。SQL 代码如下所示:

从产品中删除 source_id = x;
插入产品(field1、field2、...、source_id)
  选择字段 1、字段 2、...、x FROM Products_Temp;

此外,如果源没有太大变化,您可以考虑在下载后进行哈希处理,如果没有更改则跳过更新以节省一些数据库访问权限。

于 2012-05-22T08:24:30.327 回答