2

We've got 50,000,000 (and growing) documents which we want to be able to search.

Each "document" is in reality a page of a larger document, but the granularity required is at the page level.

Each document therefore has a few bits of metadata (e.g., which larger document it belongs to)

We originally built this using Sphinx which has served quite well, but is getting slow, despite having quite generous hardware thrown at it (via Amazon AWS).

There are new requirements coming through that mean we have to be able to pre-filter the database before searching, i.e. to search only a subset of the 50M documents based on some aspect of the metadata (e.g., "search only documents added in the last 6 months", or "search only these documents belonging to this arbitrary list of parent documents")

One significant requirement is that we group search results by parent document, e.g. to return only the first match in a parent document in order to show the user a wider range of parent documents that match in the first page of results, rather than loads of matches in the first parent document followed by loads of matches in the second, etc. We would then give the user the option to search pages within only one specific parent document.

The solution doesn't have to be "free" and there is a bit of budget to spend.

The content is sensitive and needs to be protected so we can't simply let Google index it for us, at least not in any way that would allow the general public to come across it.

I've looked at using Sphinx with even more resources (putting an index of 50M documents into memory is sadly not an option within our budget) and I've looked at Amazon CloudSearch but it seems that we'd have to spend >$4k per month and that's beyond the budget.

Any suggestions? Something deployable within AWS is a bonus. I'm aware that we may be asking for the unobtainable but if you think that's the case, please say so (and give reasons!)

4

1 回答 1

1

50M 文档对于 Sphinx 来说听起来是一项相当可行的任务。

我们最初是使用 Sphinx 构建的,它的服务非常好,但速度越来越慢,尽管有相当大的硬件投入(通过亚马逊 AWS)。

我支持上面建议分片的评论。Sphinx允许您将一个大索引拆分为多个分片,每个分片由自己的代理提供服务。您可以在同一台服务器上运行代理或将它们分布在多个 AWS 实例中。

出现了新的要求,这意味着我们必须能够在搜索之前对数据库进行预过滤,即根据元数据的某些方面仅搜索 50M 文档的子集

假设这些元字段被索引为属性,您可以将类似 SQL 的过滤器添加到每个搜索查询(例如doc_id IN (1,2,3,4) AND date_created > '2014-01-01')。

一项重要要求是我们按父文档对搜索结果进行分组

您可以按任何属性分组。

于 2014-05-09T10:42:53.307 回答