mongodb - Mongo Map-Reduce - 半径内用户的热门场地

Question

我的 MapReduce 函数有问题 - 目标是获取某个 lat/lng 内的顶级场所列表，group by vid，order by distinct user_id。

这是一个示例数据集：

  { "_id" : ObjectId("51f9234feb97ff0700000046"), "checkin_id" : 39286249, "created_at" : ISODate("2013-07-31T14:47:11Z"), "loc" : { "lat" : 42.3672, "lon" : -86.2681 }, "icv" : 1, "ipv" : 1, "vid" : 348442, "user_id" : 151556, "bid" : 9346, "pid" : 549 }
  { "_id" : ObjectId("51f9234b488fff0700000006"), "checkin_id" : 39286247, "created_at" : ISODate("2013-07-31T14:47:07Z"), "loc" : { "lat" : 55.6721, "lon" : 12.5576 }, "icv" : 1, "ipv" : 1, "vid" : 3124, "user_id" : 472486, "bid" : 7983, "pid" : 2813 }
  ...

这是我的地图功能：

map1 = function() {
  var tempDoc = {};
  tempDoc[this.user_id] = 1;

  emit(this.vid, {
     users: tempDoc,
     count: 1
  });
}

并减少：

reduce1 = function(key, values) {

    var summary = {
     users: {},
     total: 0
    };

    values.forEach(function (doc) {

       // increment total for every value
       summary.total += doc.count;

       // Object.extend() will only add keys from the right object that do not exist on the left object
      Object.extend(summary.users, doc.user);

    });


   return summary;
};

我的地理查询：

var d = Date("2013-07-31T14:47:11Z");
var geo_query = {loc: {$near: [40.758318,-73.952985], $maxDistance: 25}, "icv":1, "created_at": {$gte: d}};

最后是 mapReduce 查询：

var res = db.myColelction.mapReduce(map1, reduce1,  { out : { inline : 1 }, query : geo_query });

返回的结果与 reduce 函数匹配，但未命中 finalize1 函数：

...
{
    "_id" : 609096,
    "value" : {
        "users" : {
            "487586" : 1
        },
        "count" : 1
    }
},
{
    "_id" : 622448,
    "value" : {
        "users" : {
            "313755" : 1,
            "443180" : 1
        },
        "total" : 4
    }
},
...

此时，我认为我有一个很好的结果集，但是该$near函数只扫描附近的 100 个场地，我想扫描所有场地（所有符合此半径（25m）的文档，并查看所有场地 -将它们分组，并计算该时间段内的唯一用户数。我四处搜索，查看文档，但不确定解决方案。有接受者吗？

对我来说，最终结果将是排序并通过“total”属性限制结果。理想情况下，我想按总 desc 和限制 15 进行排序。

score 4 · Accepted Answer

我会做以下事情。首先，你有错误的坐标。MongoDB 想要longitude, latitude，最好是 GeoJSON 格式：

loc: { type: 'Point', coordinates: [-73.952985, 40.758318] },

MongoDB不关心和lat字段lon名称，并将忽略它们。

但是您也应该避免使用 Map/Reduce，因为它既慢又复杂。相反，我们可以使用聚合框架来做类似的事情：

db.so.aggregate( [
    // search for all the (well, million) venues within **250**km
    { $geoNear: {
        near: { type: 'Point', coordinates: [-73.952985, 40.758318] },
        spherical: true,
        distanceField: 'd',
        maxDistance: 250 * 1000,
        limit: 1000000
    } },
    // find only the items where icv=1
    { $match: { icv: 1 } },
    // group by venue and user
    { $group: { 
        _id: { vid: '$vid', user_id: '$user_id' }, 
        count: { $sum: 1 } } 
    },
    // then regroup by just venue:
    { $group: { 
        _id: '$_id.vid', 
        users: { $addToSet: { user_id: '$_id.user_id', count: '$count' } }, 
        total: { $sum: '$count' } 
    } },
    // now we sort by "total", desc:
    { $sort: { 'total': -1 } },
    // and limit by 15:
    { $limit: 15 }
] );

我已将其用作$geoNear第一阶段，并将匹配$icv用作第二阶段，因为索引可能会比$geoNear索引要好得多$icv（我猜，无论如何它只会有值 0 或 1）。

请注意，对于这个例子，我使用了 250 公里（250 * 1000 米）而不是 25 公里。

使用以下输入：

db.so.insert( { "_id" : ObjectId("51f9234feb97ff0700000046"), "loc" : { type: 'Point', coordinates: [ -73.2681, 40.3672 ] }, "vid" : 348442, "user_id" : 151556 } );
db.so.insert( { "_id" : ObjectId("51f9234b488fff0700000006"), "loc" : { type: 'Point', coordinates: [ -73.5576, 40.6721 ] }, "vid" : 3124, "user_id" : 472486 } );
db.so.insert( { "_id" : ObjectId("51f92345488fff0700000006"), "loc" : { type: 'Point', coordinates: [ -73.5576, 40.6721 ] }, "vid" : 3124, "user_id" : 47286 } );
db.so.insert( { "_id" : ObjectId("52f92345488fff0700000006"), "loc" : { type: 'Point', coordinates: [ -73.5576, 40.6721 ] }, "vid" : 3124, "user_id" : 47286 } );

你得到结果：

{
    "result" : [
        {
            "_id" : 3124,
            "users" : [
                { "user_id" : 472486, "count" : 1 },
                { "user_id" : 47286, "count" : 2 }
            ],
            "total" : 3
        },
        {
            "_id" : 348442,
            "users" : [
                { "user_id" : 151556, "count" : 1 }
            ],
            "total" : 1
        }
    ],
    "ok" : 1
}

您想要的输出只有一个区别，那就是 user_id 不是计数的键，而是子文档中的一个额外字段。通常，您不能使用聚合框架将值更改为键或键值。

score 0 · Accepted Answer

你说这个功能只扫描 100 个场地。我对near的理解是它会扫描整个集合，只返回最接近的100个。

从$near的文档中复制粘贴：

注意：您可以使用 cursor.limit() 进一步限制结果的数量。未定义与使用 $near 的查询一起指定批处理大小（即 batchSize()）。有关更多信息，请参阅 SERVER-5236。

mongodb - Mongo Map-Reduce - 半径内用户的热门场地

2 回答 2

Related

Reference