0

我读到 Hadoop 框架支持像 Reduce-Side、Replicated 和 Composite 连接这样的连接。Mongodb 中的这些连接是否有任何形式的支持?

对我来说,用例是每个记录的用户都有一组事件及其发生。样本是

{_id: uniqueEventId, event: login, userId: abc}

还有另一个包含有关用户的详细信息的集合,并且用户的属性不固定。示例文档是

{_id: abc, city: "SF", state: CA, customfield1: value1...}

我需要的结果是事件、userId 的聚合,还需要填充用户详细信息。样本,

{userId: abc, event: login, count:23, city: SF, state: CA}

这样我就可以根据具有最大登录事件和类似查询的州或城市进行查询。

我考虑将用户文档作为事件文档的一部分嵌入,但如果用户属性发生变化,我需要从字面上更新所有事件集合,这将是巨大的。

我查看了从此链接合并两个集合的方法,但这并不完全有用,因为我需要运行reduce函数的键是复合键(userId + event)。

4

1 回答 1

1

I would like to note this JOIN cannot be used in realtime to your app and by doing it you are breaking MongoDB, however, yes; there is a way to map-reduce a JOIN.

In your first MR that gets the:

{_id: abc, city: "SF", state: CA, customfield1: value1...}

You just emit this row and write it to a new collection. Then in your second MR where you get:

{userId: abc, event: login, count:23, city: SF, state: CA}

You make userId actually _id:

var map = function(){
    emit(this.userId, {this.event, //etc});
}

Or a compound key:

var map = function(){
    emit({o: this.userId, e: this.event}, {this.event, //etc});
}

Then you reduce as normal but change the command, or rather call, to the server so that the out option within the MR actually points to the result of your first MR adding a reduce or merge command on the out option to make the two collections join on duplicate _ids:

db.col.mapreduce( map, reduce, { out: {merge:collection_from_first_mr} })

That is basically how it works.

Going back to my first notice at the start of this answer. This is not SQL JOINs and they should not be treated as such. The JS engine is:

  • Slow
  • Single threaded
  • Not actually MongoDB or Server-side, it is actually a built in JS engine to MongoDB

If the collection gets out of control or this command is run in realtime to your app you could easily see performance problems of other JavaScript (remember it is single threaded) that needs to run on your server does productive stuff.

Edit

so that I can query based on state or city which has max login events and similar kind of queries.

Wouldn't the login occur in that city though? So maybe the login row should contain a city and state field. This won't need updating and sound kind of odd that it would since that login would happen there, not anywhere else so:

I need to update literally all of the events collection which will be huge.

Becomes obsolete since the login event will not need updating because it happened in the state/city it was recorded in which is correct.

So I would actually go for a schema of:

{_id: uniqueEventId, event: login, userId: abc, state: '', city: ''}

And aggregate on that.

于 2012-12-31T09:14:52.767 回答