group-by - 在 MongoDB GROUP BY 中进行 HAVING 的正确方法是什么？

Question

对于 SQL 中的这个查询（查找重复项）是什么：

SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1

我在 MongoDB 中执行了这个简单的查询：

res = db.col.group({key:{userId:true,name:true}, 
                     reduce: function(obj,prev) {prev.count++;}, 
                     initial: {count:0}})

我添加了一个简单的 Javascript 循环来遍历结果集，并执行了一个过滤器来查找所有计数 > 1 的字段，如下所示：

for (i in res) {if (res[i].count>1) printjson(res[i])};

除了在客户端使用 javascript 代码之外，还有更好的方法吗？如果这是最好/最简单的方法，请说它是，这个问题会对某人有所帮助:)

score 64 · Accepted Answer

使用 Mongo 聚合框架的新答案

问完这个问题，10gen 发布了 2.2 版本的 Mongodb 聚合框架。执行此查询的最佳新方法是：

db.col.aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

10gen 有一个方便的SQL 到 Mongo 聚合转换图表，值得收藏。

score 0 · Accepted Answer

已经给出的答案很容易说实话，并且由于隐式优化在幕后工作，使用投影使它变得更好。我做了一个小的改变，我正在解释它背后的积极因素。

原始命令

db.getCollection('so').explain(1).aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

解释计划中的部分

{
    "stages" : [ 
        {
            "$cursor" : {
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "5fa42c8b8778717d277f67c4_test.so",
                    "indexFilterSet" : false,
                    "parsedQuery" : {},
                    "queryHash" : "F301762B",
                    "planCacheKey" : "F301762B",
                    "winningPlan" : {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                            "name" : 1,
                            "userId" : 1,
                            "_id" : 0
                        },
                        "inputStage" : {
                            "stage" : "COLLSCAN",
                            "direction" : "forward"
                        }
                    },
                    "rejectedPlans" : []
                },
                "executionStats" : {
                    "executionSuccess" : true,
                    "nReturned" : 6000,
                    "executionTimeMillis" : 8,
                    "totalKeysExamined" : 0,
                    "totalDocsExamined" : 6000,

样本集非常小，只有 6000 个文档
此查询将处理 WiredTiger 内部缓存中的数据，因此如果集合的大小很大，那么所有内容都将保存在内部缓存中以确保执行发生。WT Cache 非常重要，如果此命令在缓存中占用如此大的空间，那么缓存大小将必须更大以容纳其他操作

现在是一个小的、hack 和添加的索引。

 db.getCollection('so').createIndex({userId : 1, name : 1})

新命令

db.getCollection('so').explain(1).aggregate( [
    {$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

解释计划

{
  "stages": [{
        "$cursor": {
          "queryPlanner": {
            "plannerVersion": 1,
            "namespace": "5fa42c8b8778717d277f67c4_test.so",
            "indexFilterSet": false,
            "parsedQuery": {
              "$and": [{
                  "name": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                }
              ]
            },
            "queryHash": "4EF9C4D5",
            "planCacheKey": "3898FC0A",
            "winningPlan": {
              "stage": "PROJECTION_COVERED",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                  "userId": 1.0,
                  "name": 1.0
                },
                "indexName": "userId_1_name_1",
                "isMultiKey": false,
                "multiKeyPaths": {
                  "userId": [],
                  "name": []
                },
                "isUnique": false,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": 2,
                "direction": "forward",
                "indexBounds": {
                  "userId": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ],
                  "name": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ]
                }
              }
            },
            "rejectedPlans": [{
              "stage": "PROJECTION_SIMPLE",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "FETCH",
                "filter": {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                "inputStage": {
                  "stage": "IXSCAN",
                  "keyPattern": {
                    "name": 1.0
                  },
                  "indexName": "name_1",
                  "isMultiKey": false,
                  "multiKeyPaths": {
                    "name": []
                  },
                  "isUnique": false,
                  "isSparse": false,
                  "isPartial": false,
                  "indexVersion": 2,
                  "direction": "forward",
                  "indexBounds": {
                    "name": [
                      "[MinKey, undefined)",
                      "(null, MaxKey]"
                    ]
                  }
                }
              }
            }]
          },
          "executionStats": {
            "executionSuccess": true,
            "nReturned": 6000,
            "executionTimeMillis": 9,
            "totalKeysExamined": 6000,
            "totalDocsExamined": 0,
            "executionStages": {
              "stage": "PROJECTION_COVERED",
              "nReturned": 6000,

检查 Projection_Covered 部分，这个命令是一个覆盖查询，基本上只是依赖于索引中的数据
此命令不需要将数据保留在 WT 内部缓存中，因为它根本不会去那里，检查检查的文档，它是 0，因为数据在索引中，它正在使用它来执行，这是一个很大的对于 WT Cache 已经受到其他操作压力的系统来说是积极的
如果有任何机会需要搜索特定名称而不是整个集合，那么这将变得有用：D
这里的缺点是添加了索引，如果此索引也用于其他操作，那么老实说没有缺点，但如果这是一个额外的添加，那么缓存中的索引将占用更多空间 + 写入会受到添加的影响一个索引

*在 6000 条记录的性能方面，显示的时间多 1 毫秒，但对于更大的数据集，这可能会有所不同。需要注意的是，我插入的示例文档只有 3 个字段，除了这里使用的两个，默认的 _id，如果这个集合有更大的文档大小，那么原始命令的执行会增加，它将在缓存也会增加。

group-by - 在 MongoDB GROUP BY 中进行 HAVING 的正确方法是什么？

2 回答 2

使用 Mongo 聚合框架的新答案

Related

Reference