我已经将大约 25 万条推文流式传输并保存到 MongoDB 中,在这里,我正在检索它,如您所见,基于推文中出现的单词或关键字。
Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("TwitterData");
DBCollection collection = db.getCollection("publicTweets");
BasicDBObject fields = new BasicDBObject().append("tweet", 1).append("_id", 0);
BasicDBObject query = new BasicDBObject("tweet", new BasicDBObject("$regex", "autobiography"));
DBCursor cur=collection.find(query,fields);
我想做的是使用 Map-Reduce 并根据关键字对其进行分类并将其传递给 reduce 函数以计算每个类别下的推文数量,有点像你在这里看到的。在示例中,他计算的是页数,因为它是一个简单的数字。我想做类似的事情:
"if (this.tweet.contains("kword1")) "+
"category = 'kword1 tweets'; " +
"else if (this.tweet.contains("kword2")) " +
"category = 'kword2 tweets';
然后使用reduce函数来获取计数,就像在示例程序中一样。
我知道语法不正确,但这正是我想做的。有没有办法实现它?谢谢!
PS:哦,我正在用Java编码。因此,Java 语法将受到高度赞赏。谢谢!
发布的代码的输出是这样的:
{ "tweet" : "An autobiography is a book that reveals nothing bad about its writer except his memory."}
{ "tweet" : "I refuse to read anything that's not real the only thing I've read since biff books is Jordan's autobiography #lol"}
{ "tweet" : "well we've had the 2012 publication of Ashley's Good Books, I predict 2013 will be seeing an autobiography ;)"}
当然,这适用于所有带有“自传”一词的推文。我想在 map 函数中使用它,将其归类为“自传推文”(以及其他关键字),然后将其发送到 reduce 函数以计算所有内容并返回带有单词 in 的推文数量它。
就像是:
{"_id" : "Autobiography Tweets" , "value" : { "publicTweets" : 3.0}}
{"_id" : "Biography Tweets" , "value" : { "publicTweets" : 15.0}}