mongodb - 在 mongodb 中删除重复文档的最快方法

Question

我在 mongodb 中有大约 170 万个文档（将来会超过 1000 万个）。其中一些代表我不想要的重复条目。文档的结构是这样的：

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

如果文档至少有一个节点与另一个具有相同名称的文档相同，则该文档是重复的。删除重复项最快的方法是什么？

score 91 · Accepted Answer

dropDups: true选项在 3.0 中不可用。

我有聚合框架的解决方案，用于收集重复项，然后一次性删除。

它可能比系统级别的“索引”更改要慢一些。但最好考虑一下您要删除重复文档的方式。

一种。一次性删除所有文件

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }},
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

湾。您可以一个一个地删除文档。

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

score 49 · Accepted Answer

假设您要从集合中永久删除包含重复name+nodes条目的文档，您可以使用以下选项添加unique索引：dropDups: true

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})

正如文档所说，对此要格外小心，因为它会从您的数据库中删除数据。首先备份您的数据库，以防它不完全符合您的预期。

更新

此解决方案仅在 MongoDB 2.x 中有效，因为该选项在 3.0 ( docs )dropDups中不再可用。

score 31 · Accepted Answer

使用 mongodump 创建集合转储

清除收藏

添加唯一索引

使用 mongorestore 恢复集合

score 14 · Accepted Answer

我发现这个解决方案适用于 MongoDB 3.4：我假设有重复的字段称为 fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

作为 mongoDB 的新手，我花了很多时间并使用其他冗长的解决方案来查找和删除重复项。但是，我认为这个解决方案简洁易懂。

它首先匹配包含 fieldX 的文档（我有一些没有此字段的文档，但我得到了一个额外的空结果）。

下一阶段按 fieldX 对文档进行分组，并且仅使用$$ROOT在每个组中插入$first文档。最后，它将整个聚合组替换为使用 $first 和 $$ROOT 找到的文档。

我不得不添加 allowDiskUse 因为我的收藏很大。

您可以在任意数量的管道之后添加它，尽管 $first 的文档提到了使用$first之前的排序阶段，但没有它它对我有用。" 不能在这里发布链接，我的声誉不到 10 :( "

您可以通过添加 $out 阶段将结果保存到新集合中...

或者，如果一个人只对几个字段感兴趣，例如 field1、field2，而不是整个文档，在没有 replaceRoot 的小组阶段：

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

score 4 · Accepted Answer

我的数据库有数百万条重复记录。@somnath 的答案并没有奏效，因此为希望删除数百万条重复记录的人们编写了对我有用的解决方案。

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

score 2 · Accepted Answer

这是一种稍微“手动”的方式：

本质上，首先，获取您感兴趣的所有唯一键的列表。

然后使用这些键中的每一个执行搜索，如果该搜索返回大于一，则删除。

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });

score 2 · Accepted Answer

当您的文档只有一小部分重复时，加快速度的提示：

您需要该字段上的索引来检测重复项。
$group 不使用索引，但它可以利用 $sort 和 $sort 使用索引。所以你应该在开头放一个 $sort 步骤
做 inplace delete_many() 而不是 $out 到新集合，这将节省大量的 IO 时间和磁盘空间。

如果你使用 pymongo 你可以这样做：

index_uuid = IndexModel(
    [
        ('uuid', pymongo.ASCENDING)
    ],
)
col.create_indexes([index_uuid])
pipeline = [
    {"$sort": {"uuid":1}},
    {
        "$group": {
            "_id": "$uuid",
            "dups": {"$addToSet": "$_id"},
            "count": {"$sum": 1}
        }
    },
    {
        "$match": {"count": {"$gt": 1}}
    },
]
it_cursor = col.aggregate(
    pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})

表现

我在一个包含 30M 文档和 1TB 大的数据库上对其进行了测试。

如果没有索引/排序，获取光标需要一个多小时（我什至没有病人等待它）。
使用索引/排序，但使用 $out 输出到新集合。如果您的文件系统不支持快照，这会更安全。但是它需要大量的磁盘空间并且需要超过 40 分钟才能完成，尽管我们使用的是 SSD。如果您在 HDD RAID 上，它会慢得多。
使用 index/sort 和 inplace delete_many，总共需要大约 5 分钟。

score 1 · Accepted Answer

以下方法合并具有相同名称的文档，同时仅保留唯一节点而不复制它们。

我发现使用$out运算符是一种简单的方法。我展开数组，然后通过添加到集合对其进行分组。$out运算符允许聚合结果持久化[ docs]。如果您输入集合本身的名称，它将用新数据替换集合。如果名称不存在，它将创建一个新集合。

希望这可以帮助。

allowDiskUse可能必须添加到管道中。

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

score 1 · Accepted Answer

使用pymongo这应该可以。

在 unique_field 中添加集合需要唯一的字段

unique_field = {"field1":"$field1","field2":"$field2"}

cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

根据重复数对 dups 数组进行切片（这里我只有一个额外的重复）

items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})

score 1 · Accepted Answer

首先，您可以找到所有重复项并在数据库中删除这些重复项。在这里，我们使用 id 列来检查和删除重复项。

db.collection.aggregate([
    { "$group": { "_id": "$id", "count": { "$sum": 1 } } },
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
    { "$sort": { "count": -1 } },
    { "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
    var dr = data.map(d => d.name);
    console.log("duplicate Recods:: ", dr);
    db.collection.remove({ id: { $in: dr } }).then(removedD => {
        console.log("Removed duplicate Data:: ", removedD);
    })
})

score 1 · Accepted Answer

以下 Mongo 聚合管道执行重复数据删除并将其输出回相同或不同的集合。

collection.aggregate([
  { $group: {
    _id: '$field_to_dedup',
    doc: { $first: '$$ROOT' }
  } },
  { $replaceRoot: {
    newRoot: '$doc'
  } },
  { $out: 'collection' }
], { allowDiskUse: true })

score 0 · Accepted Answer

一般的想法是使用 findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ 从集合中的重复记录中检索一个随机 id。
删除集合中除我们从 findOne 选项检索到的随机 ID 之外的所有记录。

如果您尝试在 pymongo 中执行此操作，则可以执行此类操作。

def _run_query():

        try:

            for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
                    _logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

从外壳：

将 find_one 替换为 findOne
相同的删除命令应该可以工作。

score 0 · Accepted Answer

我不知道它是否会回答主要问题，但对其他人来说它会很有用。1.使用 findOne() 方法查询重复行并将其存储为对象。

const User = db.User.findOne({_id:"duplicateid"});

2.执行deleteMany()方法删除所有id为“duplicateid”的行

db.User.deleteMany({_id:"duplicateid"});

3.插入存储在用户对象中的值。

db.User.insertOne(User);

简单快捷！！！！

mongodb - 在 mongodb 中删除重复文档的最快方法

13 回答 13

表现

Related

Reference