1

我一直在分析提高 SOLR 索引性能的最佳方法,并且可能会分片当前索引以允许搜索变得分布式。

然而,鉴于我们的索引超过 400GB 并且包含大约 700MM 文档,重新索引数据似乎很麻烦。我一直在玩弄复制索引和删除文档的想法,以此来更有效地创建分片环境。

不幸的是,模数似乎无法用于查询文档的内部数字 ID。我可以使用哪些其他可能的分区策略来通过查询而不是完整的重新索引来删除?

4

3 回答 3

1

一个 lucene 工具可以完成IndexSplitter的工作,请参阅此处提到的文章链接(日语,用谷歌翻译它......)

于 2012-06-29T20:09:31.763 回答
0

If you can find a logical key to partition the data, then it will be helpful in more than one way. For eg. can you have these documents split across shards based on some chronological order?

We have a similar situation. We have an index of 250M docs that are split across various shards based on their created date. A major use case involves searching across these shards based on a range of created date. So, the search is only submitted to the shards that contain the docs with the given date range. There may be other benefits to logically partitioned data - for eg. different capacity planning, applying different qualities of service to search terms etc..

于 2012-06-29T20:43:51.473 回答
0

我在另一个 StackOverflow 问题中回答了这个问题。我编写了一个命令行实用程序(Hash-Based Index Splitter),用于根据每个文档的 ID 散列拆分 Lucene 索引。

于 2012-10-12T10:54:32.607 回答