mongodb - mongo mapreduce 后减少的记录数

Question

这是我的 mapreduce 代码：

DBCollection mongoCollection = MongoDAO.getCollection();
String map = "function() {"
        + "for (index in this.positions.positionList) {"
        + "emit(this._id+'|'+this.headline+'|'+"
        + "this.location.name+'|'+this.location.country.code+'|'+this.publicProfileUrl+'|'+"
        + "this.positions.positionList[index].title+'|'+"
        + "this.positions.positionList[index].company.name+'|'+this.positions.positionList[index].company.industry+'|'+"
        + "this.positions.positionList[index].company.type+'|'+this.positions.positionList[index].company.size+'|'+"
        + "this.lastName+'|'+this.firstName+'|'+this.industry+'|'+this.updatedDate+'|' , {count: 1});"
        + "}}";
String reduce = "";
MapReduceCommand mapReduceCommand = new MapReduceCommand(
        mongoCollection, map, reduce.toString(), "final_result",
        MapReduceCommand.OutputType.REPLACE, null);

MapReduceOutput out = mongoCollection.mapReduce(mapReduceCommand);

目前我正在处理 140,000 条记录。但是在执行 mapreduce 时，记录数减少到 90,000。数据集中没有重复记录。

score 1 · Accepted Answer

更改您的发出以发出 _id 作为键和以管道分隔的字符串作为值。举个例子：

emit(this._id, [this._id, this.a, this.b,...].join('|'))

我认为正在发生的事情是您在键中制作了过长的字符串。_id 值的限制为 1KB（在 2.0 中，高于之前的 800B），这就是键的大小。

此外，您可能想要查看预打包的 mongodb-hadoop 连接器，而不是自己滚动：https ://github.com/mongodb/mongo-hadoop

mongodb - mongo mapreduce 后减少的记录数

1 回答 1

Related

Reference