mongodb - 数据库和搜索专家：多值字段的加权分面搜索，有一个转折

Question

这个问题几乎可以归结为“多面搜索，多个多值字段，按权重排序而不是计数”。

数据库

我有大约 1000 万个事件，每个事件都有多个版本，每个版本都由标签描述。有 5 种标签类型（地点、演讲者、参与者、主题、行业）。

{
    title: "CES",
    editions: [
        {
            date: "2013-02-01",
            tags: [ {label: "Eric Schmidt", type: "speaker", "popularity": 50}, {label: "Paris", type: "place", "popularity": 30} ]
        },
        {
            date: "2012-01-23",
            tags: [ ... ] 
        }
    ]
}

数据逻辑

标签是分层的，例如，“Eric Schmidt”在 Google 下归档，而在 Tech 公司下归档。因此，每当 Eric 参加活动时，所有三个标签都与该活动相关联。
不同的标签可以有不同的流行度，这意味着“Eric Schmidt”的流行度为 100，而“Eileen Naughton”的流行度为“10”。
流行度不适用于分层。这意味着，如果“Eric Schmidt”离开 Google 加入 Foursquare，他的人气仍然是 100，而 Foursquare 的人气仍然是 50。
如果在给定时间，我们发现另一个“参与者”参加了，例如，我们需要能够将他添加为标签

搜索要求

现在，想象一个包含 4 个部分的左侧菜单：

Places
------------
Paris
London
New York
[more]

Speakers
----------
Google
Facebook
Marc Zuckerberg
[more]

等等。

每当用户单击标签时，我希望菜单反映结果并允许他进一步下钻（分面搜索）。不同之处在于，当决定在每个部分的前三个标签中显示“Google”、“Eric Schmidt”和“Foursquare”时，我想确保最受欢迎的标签显示得更高，基于[匹配事件的数量] * [标签受欢迎程度]。这意味着如果“Foursquare”有 3 个匹配事件，而“Eric Schmidt”只有一个匹配事件，则它应该首先显示 Foursquare，得分为 3*50 = 150，而施密特的得分为 1 * 100。

另外，理想情况下，如果我选择“谷歌”，那么对于“演讲者”部分，系统不应该返回谷歌之外的演讲者，即使匹配的事件也列出了“扎克伯格”，拥有 200 的巨大人气。所以，返回的标签应位于每个部分中当前选择的“下方”，并且它们的排序应基于上述评分逻辑。

当前的 MongoDB 解决方案

为每个版本存储一个单独的文档：

{
    event: "CES",
    date: "2013-02-01",
    tags: [ {label: "Eric Schmidt", type: "speaker", "popularity": 50, path: ",Tech Companies,Google,"}, {label: "Paris", type: "place", "popularity": 30, path: ",Europe,France,"} ]
},
{
    event: "CES",
    date: "2012-01-23",
    tags: [ ... ] 
}

使用聚合框架

*每种标签类型一个查询（每个请求 5 个查询）*

db.events.aggregate(
{
    '$match': {'tags.label': {'$all': ["selected tag 1", "selected tag2", ...]}}
},
{
    '$unwind': '$tags'
},
// group by events, so we can later sum each tag's popularity only once per event, not per event edition 
{
    '$group': {
        '_id': '$event', 
        'taglistUnqiue': {
            '$addToSet': {
                'label': '$tags.label', 
                'type': '$tags.type', 
                'popularity': '$tags.popularity'
            }
        }
    }
},
{
    '$unwind': '$taglist'
},
{
    '$match': {
        'taglist.type': "speaker",
        /* haven't tested this path-matching, but it should work 
        to only get the tags that are in the bottom tree 
        of the current selected speaker tag */
        'taglist.path': /^,selected speaker tag,/, 
    }
},
{
    '$group': {
        '_id': '$taglist.label',
        'score': {
            '$sum': '$taglist.popularity'
        }
    }
});

好的，这在算法上应该可以工作，但在性能方面，它肯定不适用于 50M 事件版本，每个版本都有数千个可能的标签。

谁能想到另一种方法？除了使用我理解的“Map/Reduce”之外，这种方法是否可以通过任何方式进行优化，因为它太慢而无法为每个用户即时执行？

score 0 · Accepted Answer

根据您的搜索需要的“实时”程度，您是否考虑过使用增量映射/减少？

http://docs.mongodb.org/manual/tutorial/perform-incremental-map-reduce/

mongodb - 数据库和搜索专家：多值字段的加权分面搜索，有一个转折

数据库

数据逻辑

搜索要求

当前的 MongoDB 解决方案

为每个版本存储一个单独的文档：

使用聚合框架

1 回答 1

Related

Reference