mongodb - Mongo：统计一组文档中单词出现的次数

Question

我在 Mongo 中有一组文档。说：

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

我想计算每个单词的出现次数（不区分大小写），然后按降序排序。结果应该是这样的：

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

知道怎么做吗？聚合框架将是首选，因为我已经在某种程度上理解它:)

score 25 · Accepted Answer

MapReduce可能是一个很好的选择，它可以在服务器上处理文档而无需在客户端上进行操作（因为在 DB 服务器上没有拆分字符串的功能（未解决的问题）。

从功能开始map。在下面的示例中（可能需要更健壮），每个文档都被传递给map函数（as this）。代码查找该summary字段，如果存在，则将其小写，在空格上拆分，然后1为找到的每个单词发出一个。

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

然后，在reduce函数中，它将函数找到的所有结果相加，并为上面提到map的每个单词返回一个离散的总数。emit

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

最后，执行 mapReduce：

> db.so.mapReduce(map, reduce, {out: "word_count"})

样本数据的结果：

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

score 7 · Accepted Answer

一个基本的 MapReduce 示例

var m = function() {
    var words = this.summary.split(" ");
    if (words) {
        for(var i=0; i<words.length; i++) {
            emit(words[i].toLowerCase(), 1);
        }   
    }
}

var r = function(k, v) {
    return v.length;
};

db.collection.mapReduce(
    m, r, { out: { merge: "words_count" } }
)

这会将字数插入到集合名称 words_count 中，您可以对其进行排序（和索引）

请注意，它不使用词干、省略标点符号、处理停用词等。

另请注意，您可以通过累积重复出现的单词并发出计数来优化地图功能，而不仅仅是 1

score 3 · Accepted Answer

您可以使用#split。试试下面的查询

db.summary.aggregate([
{ $project : { summary : { $split: ["$summary", " "] } } },
{ $unwind : "$summary" },
{ $group : { _id:  "$summary" , total : { "$sum" : 1 } } },
{ $sort : { total : -1 } }
]);

score 0 · Accepted Answer

老问题，但从 4.2 开始，现在可以使用 $regexFindAll 来完成。

db.summaries.aggregate([
  {$project: {
    occurences: {
      $regexFindAll: {
        input: '$summary',
        regex: /\b\w+\b/, // match words
      }
    }
  }},
  {$unwind: '$occurences'},
  {$group: {
    _id: '$occurences.match', // group by each word
    totalOccurences: {
      $sum: 1 // add up total occurences
    }
  }},
  {$sort: {
    totalOccurences: -1
  }}
]);

这将以以下格式输出文档：

{
  _id: "matchedwordstring",
  totalOccurences: number
}

mongodb - Mongo：统计一组文档中单词出现的次数

4 回答 4

Related

Reference