mongodb - 通过键字段查找 MongoDB 集合中的所有重复文档

Question

假设我有一个包含一组文档的集合。像这样的东西。

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

我想通过“名称”字段查找此集合中的所有重复条目。例如，“foo”出现两次，“bar”出现 3 次。

score 151 · Accepted Answer

接受的答案在大型集合上非常慢，并且不返回_id重复记录的 s 。

聚合要快得多，并且可以返回_ids：

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

在聚合管道的第一阶段，$group 运算符按字段聚合文档name并存储在分组记录的uniqueIds每个_id值中。$sum运算符将传递给它的字段的值相加，在本例中为常量1- 从而计算字段中分组记录的数量count。

在管道的第二阶段，我们使用$match 来过滤 acount至少为 2 的文档，即重复。

然后，我们首先对最频繁的重复进行排序，并将结果限制在前 10 位。

此查询将输出最多$limit具有重复名称的记录，以及它们_id的 s。例如：

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

score 17 · Accepted Answer

注意：这个解决方案是最容易理解的，但不是最好的。

您可以使用mapReduce来找出文档包含特定字段的次数：

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

score 5 · Accepted Answer

有关通用 Mongo 解决方案，请参阅MongoDB 食谱食谱以使用group. 请注意，聚合更快、更强大，因为它可以返回_id重复记录的 s。

对于pymongo，接受的答案（使用 mapReduce）不是那么有效。相反，我们可以使用group方法：

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

输出将是这样的：

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

等效的 SQL 查询将是：SELECT name, COUNT(name) FROM prb GROUP BY name. 请注意，我们仍然需要从数组中过滤掉计数为 0 的元素。再次，请参阅MongoDB 食谱食谱以查找重复项，group用于使用group.

score 3 · Accepted Answer

聚合管道框架可用于轻松识别具有重复键值的文档：

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

~ 参考：官方 mongo 实验室博客上的有用信息：

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

score 1 · Accepted Answer

这里接受的最高答案是：

uniqueIds: { $addToSet: "$_id" },

这也将返回给您一个名为 uniqueIds 的新字段，其中包含 id 列表。但是，如果您只想要该字段及其计数怎么办？那么它会是这样的：

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

为了解释这一点，如果您来自 MySQL 和 PostgreSQL 等 SQL 数据库，您习惯于使用与 GROUP BY 语句一起使用的聚合函数（例如 COUNT()、SUM()、MIN()、MAX()），例如例如，查找列值出现在表中的总计数。

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

如您所见，我们的输出显示了每个 my_type 值出现的计数。要在 MongoDB 中查找重复项，我们将以类似的方式解决该问题。MongoDB 拥有聚合操作，将来自多个文档的值组合在一起，并且可以对分组的数据执行各种操作以返回单个结果。这是一个类似于 SQL 中聚合函数的概念。

假设有一个名为 contacts 的集合，初始设置如下所示：

db.contacts.aggregate([ ... ]);

这个聚合函数接受一个聚合运算符数组，在我们的例子中，我们需要 $group 运算符，因为我们的目标是按字段的计数对数据进行分组，即字段值的出现次数。

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

这种方法有一点点怪癖。_id 字段是使用 group by 运算符所必需的。在这种情况下，我们对 $name 字段进行分组。_id 中的键名可以是任何名称。但是我们使用名称，因为它在这里很直观。

通过仅使用 $group 运算符运行聚合，我们将获得所有名称字段的列表（无论它们在集合中出现一次还是多次）：

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

请注意上面的聚合是如何工作的。它获取带有名称字段的文档并返回提取的名称字段的新集合。

但我们想知道的是，该字段值重复出现了多少次。$group 运算符采用一个计数字段，该字段使用 $sum 运算符将表达式 1 添加到组中每个文档的总数中。因此，$group 和 $sum 一起返回给定字段（例如名称）产生的所有数值的总和。

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

由于目标是消除重复，它需要一个额外的步骤。要仅获取计数大于 1 的组，我们可以使用 $match 运算符来过滤我们的结果。在 $match 运算符中，我们将告诉它查看计数字段并告诉它使用表示“大于”和数字 1 的 $gt 运算符查找大于 1 的计数。

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

附带说明一下，如果您通过像 Mongoid for Ruby 这样的 ORM 使用 MongoDB，您可能会收到以下错误：

The 'cursor' option is required, except for aggregate with the explain argument

这很可能意味着您的 ORM 已过时并且正在执行 MongoDB 不再支持的操作。因此，要么更新您的 ORM，要么找到修复程序。对于 Mongoid，这是对我的修复：

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

mongodb - 通过键字段查找 MongoDB 集合中的所有重复文档

5 回答 5

Related

Reference