0

我已经查看了另一个具有类似主题的问题,但它并没有解决我目前遇到的问题。

我有两张桌子:

users (id, name)

projects (id, user_id, image, inserted)

由于“项目”表中的 Flash 应用程序出现错误,因此存在许多重复项(一个项目被添加了多次)。连续副本之间存在几秒钟的时间差(小于 10 秒),这是确定重复项的唯一方法(用户可以添加无限数量的项目,但创建一个项目至少需要一分钟)。

如何选择和删除副本(并保留原始副本)?

编辑:

Robin Castlin 在下面发布的解决方案几乎就在那里,但是这个查询:

SELECT p2.id
FROM project AS p
INNER JOIN project AS p2
ON p.id != p2.id AND p.user_id = p2.user_id AND 
    ABS(TIME_TO_SEC(TIMEDIFF(p.inserted, p2.inserted))) <= 10
GROUP BY p2.id

选择所有副本(如果用户添加项目 5 次,它会给我 5 个 ID)。所以让我们反过来问题:如何从该组中选择除第一个/最后一个之外的所有内容?还是只有第一个/最后一个?

4

2 回答 2

3
CREATE TEMPORARY TABLE tmp_project (
    p1_id INT,
    p2_id INT
)
SELECT p.id, p2.id
FROM project AS p
INNER JOIN project AS p2
ON p.user_id = p2.user_id AND 
    ABS(TO_SECONDS(TIME_DIFF(p.inserted, p2.inserted))) <= 10;

SELECT p2_id
FROM tmp_project
WHERE p2_id NOT IN (SELECT p2_id
                    FROM tmp_project
                    GROUP BY p1_id)
GROUP BY p2_id;

I made it a bit more complex now. Since we would need to use the same rows 2 times to filter the 1st occurence, I create a temporary table and handle it from there after. I join all the cases, even on the same id, and then filter it by using NOT IN and GROUP BY p1_id.

This solution could also be used if image data where identical for duplicates:

Shouldn't the image field be identical in these cases?

SELECT id
FROM project
WHERE id NOT IN (   SELECT id
                    FROM projects
                    GROUP BY image, user_id)

This would make get you a list of all the duplicates that isn't the first one in the table.


Then take those IDs and simply

DELETE FROM project WHERE id IN (id1, id2, id3, ...)
于 2013-04-11T08:18:26.380 回答
0

取两个连续时间之间的差。

如果差异为 10 秒 [根据您的帖子],则不要添加它。

这是一个可以帮助您准确计算时间差异的问题。

如何在 sql server 2005 中拆分时间并计算时差?

于 2013-04-11T08:13:46.730 回答