java - 在 MongoDB Map Reduce 函数中查询

Question

我已经将大约 25 万条推文流式传输并保存到 MongoDB 中，在这里，我正在检索它，如您所见，基于推文中出现的单词或关键字。

Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("TwitterData");
DBCollection collection = db.getCollection("publicTweets");
BasicDBObject fields = new BasicDBObject().append("tweet", 1).append("_id", 0);
BasicDBObject query = new BasicDBObject("tweet", new BasicDBObject("$regex", "autobiography"));
DBCursor cur=collection.find(query,fields);

我想做的是使用 Map-Reduce 并根据关键字对其进行分类并将其传递给 reduce 函数以计算每个类别下的推文数量，有点像你在这里看到的。在示例中，他计算的是页数，因为它是一个简单的数字。我想做类似的事情：

"if (this.tweet.contains("kword1")) "+
"category = 'kword1 tweets'; " + 
"else if (this.tweet.contains("kword2")) " + 
"category = 'kword2 tweets';

然后使用reduce函数来获取计数，就像在示例程序中一样。

我知道语法不正确，但这正是我想做的。有没有办法实现它？谢谢！

PS：哦，我正在用Java编码。因此，Java 语法将受到高度赞赏。谢谢！

发布的代码的输出是这样的：

{ "tweet" : "An autobiography is a book that reveals nothing bad about its writer except his memory."}
{ "tweet" : "I refuse to read anything that's not real the only thing I've read since biff books is Jordan's autobiography #lol"}
{ "tweet" : "well we've had the 2012 publication of Ashley's Good Books, I predict 2013 will be seeing an autobiography ;)"}

当然，这适用于所有带有“自传”一词的推文。我想在 map 函数中使用它，将其归类为“自传推文”（以及其他关键字），然后将其发送到 reduce 函数以计算所有内容并返回带有单词 in 的推文数量它。

就像是：

{"_id" : "Autobiography Tweets" , "value" : { "publicTweets" : 3.0}}
{"_id" : "Biography Tweets" , "value" : { "publicTweets" : 15.0}}

score 7 · Accepted Answer

您可能想尝试以下方法：

    String map = "function() { " +
                 "    var regex1 = new RegExp('autobiography', 'i'); " +
                 "    var regex2 = new RegExp('book', 'i'); " +
                 "    if (regex1.test(this.tweet) ) " +
                 "         emit('Autobiography Tweet', 1); " +
                 "    else if (regex2.test(this.tweet) ) " +
                 "         emit('Book Tweet', 1); " +
                 "    else " +
                 "       emit('Uncategorized Tweet', 1); " +
                 "}";

    String reduce = "function(key, values) { " +
                    "    return Array.sum(values); " +
                    "}";

    MapReduceCommand cmd = new MapReduceCommand(collection, map, reduce,
             null, MapReduceCommand.OutputType.INLINE, null);
    MapReduceOutput out = collection.mapReduce(cmd);

    try {
        for (DBObject o : out.results()) {

            System.out.println(o.toString());

       }
    } catch (Exception e) {
        e.printStackTrace();
    }

score 5 · Accepted Answer

尽管您已经接受了 Kay 的答案并且这个答案可能会被忽略，但我想提出一个替代解决方案。

MongoDB 文档中有一篇关于如何在 Mongo 中执行全文搜索的文章。为了允许基于文本的字段快速搜索单个单词，他们建议通过将文本字段拆分为单个单词的数组来准备文档，将这些数组与全文一起存储在文档中，并在此基础上创建索引大批。

之后，您可以非常快速地找到包含特定单词的所有文档，因为您的搜索查询可以 1. 使用索引 2. 不必使用正则表达式（这可能非常昂贵）。

java - 在 MongoDB Map Reduce 函数中查询

2 回答 2

Related

Reference