mongodb - MongoDB 2.2 聚合框架按字段名分组

Question

是否可以按字段名称分组？还是我需要一个不同的结构，以便我可以按值分组？

我知道我们可以在值上使用 group by 并且我们可以展开数组，但是是否可以在这里的三个房子中获得 John 拥有的苹果、梨和橙子的总数，而无需明确指定“apples”、“pears”和“oranges”为查询的一部分？（所以不喜欢这样）；

// total all the fruit John has at each house
db.houses.aggregate([
    {
        $group: {
            _id: null,

            "apples":  { $sum: "$people.John.items.apples" },
            "pears":   { $sum: "$people.John.items.pears" }, 
            "oranges": { $sum: "$people.John.items.oranges" }, 
        }
    },
])

换句话说，我可以按“项目”下的第一个字段名称进行分组，得到苹果：104、梨：202 和橙子：306 的总和，还有香蕉、甜瓜和其他可能存在的东西？或者我是否需要将数据重组为一个键/值对数组，如类别？

db.createCollection("houses");
db.houses.remove();
db.houses.insert(
[
    {
        House: "birmingham",
        categories : [
            {
                k : "location",
                v : { d : "central" }
            }
        ],
        people: {
            John: {
                items: {
                    apples: 2,
                    pears: 1,
                    oranges: 3,
                }
            },
            Dave: {
                items: {
                    apples: 30,
                    pears: 20,
                    oranges: 10,
                },
            },
        },
    },
    {
        House: "London", categories: [{ k: "location", v: { d: "central" } }, { k: "type", v: { d: "rented" } }],
        people: {
            John: { items: { apples: 2, pears: 1, oranges: 3, } },
            Dave: { items: { apples: 30, pears: 20, oranges: 10, }, },
        },
    },
    {
        House: "Cambridge", categories: [{ k: "type", v: { d: "rented" } }],
        people: {
            John: { items: { apples: 100, pears: 200, oranges: 300, } },
            Dave: { items: { apples: 0.3, pears: 0.2, oranges: 0.1, }, },
        },
    },
]
);

其次，更重要的是，我是否也可以按 "house.categories.k" 分组？换句话说，是否有可能找出“租用”与“拥有”或“朋友”房屋中有多少“苹果”“约翰”（因此按“categories.k.type”分组）？

最后 - 如果这是可能的，这是否明智？起初我认为使用对象的实际字段名称创建嵌套对象的字典非常有用，因为它似乎是文档数据库的逻辑使用，并且它似乎使 MR 查询更容易编写 vs 数组，但现在我'我开始怀疑这是否都是一个坏主意，并且具有可变字段名称会使编写聚合查询变得非常棘手/效率低下。

score 3 · Accepted Answer

好的，所以我想我已经部分解决了。至少对于最初问题中的数据形状。

// How many of each type of fruit does John have at each location
db.houses.aggregate([
    {
        $unwind: "$categories"
    },
    {
        $match: { "categories.k": "location" }
    },
    {
        $group: {
            _id: "$categories.v.d",
            "numberOf": { $sum: 1 },
            "Total Apples": { $sum: "$people.John.items.apples" },
            "Total  Pears": { $sum: "$people.John.items.pears" },
        }
    },
])

产生；

{
        "result" : [
                {
                        "_id" : "central",
                        "numberOf" : 2,
                        "Total Apples" : 4,
                        "Total  Pears" : 2
                }
        ],
        "ok" : 1
}

请注意，只有“中央”，但如果我的数据库中有其他“位置”，我会得到每个位置的总计范围。如果我为“类别”命名属性而不是数组，则不需要 $unwind 步骤，但这是我发现结构与自身不一致的地方。“类别”下可能有几个关键字。样本数据显示了“类型”和“位置”，但其中可能有大约 10 个这些分类，它们都具有不同的值。因此，如果我使用命名字段；

"categories": {
  location: "london",
  type: "owned",
}

...然后我遇到的问题是索引。我不能简单地索引“位置”，因为这些是用户定义的类别，如果 10,000 个用户选择 10,000 种不同的房屋分类方式，我需要 10,000 个索引，每个字段一个。但是通过使它成为一个数组，我只需要一个数组字段本身。缺点是 $unwind 步骤。我之前用 MapReduce 遇到过这个问题。如果可以的话，您最不想做的是 JavaScript 中的 ForEach 循环来循环数组。您真正想要的是按名称过滤掉字段，因为它更快。

现在这一切都很好，我已经知道我在寻找什么水果，但如果我不知道，那就更难了。我不能（据我所见）在这里 $unwind 或 ForEach "people.John.items" 。如果可以，我会非常高兴。因此，由于水果的名称再次是用户定义的，看起来我也需要将它们转换为数组，就像这样；

{
    "people" : {
        "John" : {
            "items" : [
                { k:"apples", v:100 },
                { k:"pears", v:200 },
                { k:"oranges", v:300 },
            ]
        },
    }
}

因此，现在我可以再次按位置汇总水果（我不知道要寻找哪种水果）；

db.houses.aggregate([
    {
        $unwind: "$categories"
    },
    {
        $match: { "categories.k": "location" }
    },
    {
        $unwind: "$people.John.items" 
    },
    {
        $group: { // compound key - thanks to Jenna
            _id: { fruit:"$people.John.items.k", location:"$categories.v.v" },
            "numberOf": { $sum: 1 },
            "Total Fruit": { $sum: "$people.John.items.v" },
        }
    },
])

所以现在我正在做两个 $unwinds。如果您认为这看起来非常低效，那您是对的。如果我只有 10,000 条房屋记录，每条记录有 10 个类别和 10 种水果，则此查询需要半分钟才能运行。好的，所以我可以看到在 $unwind 之前移动 $match 可以显着改善事情，但这是错误的输出。我不希望每个类别都有一个条目，我只想过滤掉“位置”类别。

score 2 · Accepted Answer

我会发表此评论，但在响应文本框中格式化更容易。

{ _id: 1,
  house: "New York",
  people: {
      John: {
          items: {apples: 1, oranges:2}
      }
      Dave: {
          items: {apples: 2, oranges: 1}
      }
  }
}

{ _id: 2,
      house: "London",
      people: {
          John: {
              items: {apples: 3, oranges:2}
          }
          Dave: {
              items: {apples: 1, oranges:3}
          }
      }
}

只是为了确保我理解你的问题，这是你想要完成的吗？

{location: "New York", johnFruit:3}
{location: "London", johnFruit: 5}

由于类别没有嵌套在 house 下，因此您不能按“house.categories.k”分组，但您可以使用 $group 的 _id 的复合键来获得此结果：

{ $group: _id: {house: "$House", category: "$categories.k"}

尽管“k”不包含您可能试图分组的信息。而对于“categories.k.type”，type是k的值，所以不能使用这个语法。您必须按“categories.vd”进行分组。

您当前的模式可能会使用 $unwind、$project、可能的 $match 和最后的 $group 来完成此聚合，但命令不会很漂亮。如果可能的话，我强烈建议重组您的数据以使这种聚合更简单。如果您需要有关架构的帮助，请告诉我们。

score 0 · Accepted Answer

我不确定这是否是一个可能的解决方案，但是如果您通过使用 distinct() 确定不同位置的数量来开始聚合过程，并为每个位置运行单独的聚合命令，该怎么办？distinct() 可能效率不高，但每个后续聚合都可以使用 $match，因此可以使用类别索引。您可以使用相同的逻辑来计算“categories.type”的水果。

{
    "_id" : 1,
    "house" : "New York",
    "people" : {
        "John" : [{"k" : "apples","v" : 1},{"k" : "oranges","v" : 2}],
        "Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 1}]
    },
    "categories" : [{"location" : "central"},{"type" : "rented"}]
}
{
    "_id" : 2,
    "house" : "London",
    "people" : {
        "John" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 2}],
        "Dave" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 1}]
    },
    "categories" : [{"location" : "suburb"},{"type" : "rented"}]
}
{
    "_id" : 3,
    "house" : "London",
    "people" : {
        "John" : [{"k" : "apples","v" : 0},{"k" : "oranges","v" : 1}],
        "Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 4}]
    },
    "categories" : [{"location" : "central"},{"type" : "rented"}]
}

运行 distinct() 并通过对“categories.location”的每个唯一值运行 aggregate() 命令来遍历结果：

db.agg.distinct("categories.location")
[ "central", "suburb" ]

db.agg.aggregate(
    {$match: {categories: {location:"central"}}}, //the index entry is on the entire 
    {$unwind: "$people.John"},                    //document {location:"central"}, so 
    {$group:{                                     //use this syntax to use the index
         _id:"$people.John.k", 
         "numberOf": { $sum: 1 },
         "Total Fruit": { $sum: "$people.John.v"}
        }
     }
 )


{
    "result" : [
        {
            "_id" : "oranges",
            "numberOf" : 2,
            "Total Fruit" : 3
        },
        {
            "_id" : "apples",
            "numberOf" : 2,
            "Total Fruit" : 1
        }
    ],
    "ok" : 1
}

mongodb - MongoDB 2.2 聚合框架按字段名分组

3 回答 3

Related

Reference