我们正在 MongoDB 之上构建一个简化版本的搜索引擎。
样本数据集
{ "_id" : 1, "dept" : "tech", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 2, "dept" : "tech", "updDate": ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" }
{ "_id" : 3, "dept" : "kitchen", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" }
{ "_id" : 4, "dept" : "kitchen", "updDate": ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" }
{ "_id" : 5, "dept" : "food", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" }
{ "_id" : 6, "dept" : "food", "updDate": ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" }
{ "_id" : 7, "dept" : "food", "updDate": ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 8, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
{ "_id" : 9, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
我们希望避免使用“offset-limit”对结果进行分页,为了做到这一点,我们基本上是通过修改查询的“where/match”子句来使用“seek 方法”,以便能够使用索引而不是遍历集合以获取所需的结果。有关“寻求方法”的更多信息,我强烈建议您阅读http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
搜索引擎通常按分数排序结果,并按后代顺序更新日期。为此,我们在聚合管道中使用文本搜索功能,如下所示。
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
第一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit: 2 }] )
{ "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 }
{ "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }
第二页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit: 2 }] )
{ "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
{ "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
最后一页
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit: 2 }] )
{ "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
请注意我们如何按分数、updDate 和 id 对结果进行排序,以及在第二个匹配阶段我们如何尝试使用文档的分数值、更新日期和最后的 id 对它们进行分页。
索引创建考虑到文本索引前缀字段不能涵盖文本查询,请参阅问题https://jira.mongodb.org/browse/SERVER-13018,尽管我不确定这是否适用于我们的案例。
由于“executionStats”和“allPlansExecution”模式在聚合框架中不起作用,请参阅https://jira.mongodb.org/browse/SERVER-19758我不知道 MongoDB 如何尝试解析查询。
由于索引交集不适用于文本搜索,请参阅https://jira.mongodb.org/browse/SERVER-3071(在 2.5.5 解决)和http://blog.mongodb.org/post/87790974798/efficient -indexing-in-mongodb-26作者所说的
As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.
在阅读了https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf的 3.4 节(文本搜索教程)和 3.5 节(索引策略)后,没有得出任何明确的结论。
那么从文本搜索的角度来看,对该集合进行索引的最佳索引策略是什么?
第一个匹配阶段的一个索引和第二个(分页)匹配阶段的另一个索引?
db.inventory.createIndex({description:"text", dept: -1})
db.inventory.createIndex({updDate: -1, id:-})
考虑到两个匹配阶段的字段的复合索引?
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
以上都不是?
谢谢