6

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.

My plan is to build an index using Lucene for this form of search.

My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.

This could be done in two ways (That I can think of so far):

  • N amount of queries (Horrible)
  • Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.

I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.

Advice?

4

4 回答 4

4

我会将“前端”数据存储在索引本身内,避免任何数据库交互。仅当您需要有关特定记录的更多信息时才会查询数据库。

于 2010-11-12T08:16:30.337 回答
2

When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.

于 2009-06-13T14:51:54.060 回答
1

我想答案取决于您要对结果做什么,如果您要在网格中显示结果并让用户选择他想要访问的确切文档,那么您可能需要在索引中添加足够的文本帮助用户识别文档,比如 200 个字符的简介,然后一旦成员选择一个文档,就可以访问数据库来检索整个内容。

这肯定会影响索引的大小,因此这是您需要牢记的另一个考虑因素。我还会在数据库和前端之间放置一个缓存,这样最常用的项目就不会每次都产生数据库访问的全部成本。

于 2011-04-12T22:53:26.370 回答
0

可能不是一个选项,具体取决于您的数据库中有多少东西,但我所做的是将 db id 与我想要索引的属性一起存储在搜索索引中。然后在我的服务类中,我缓存了显示所有对象的搜索结果所需的所有数据(例如,名称、数据库 ID、图像 url、描述信息、社交媒体信息)。服务类返回一个可以通过 db id 查找对象的 Dictionary,我使用 Lucene.NET 返回的 id 从内存缓存中提取数据。

您还可以放弃内存缓存并存储所有必要的属性,以便在搜索索引中显示搜索结果。我没有这样做,因为内存缓存也用于搜索以外的场景。

内存中的缓存在几个小时内总是新鲜的,我唯一需要访问数据库的时间是如果我需要为单个对象提取更详细的数据(如果用户单击特定对象的链接以转到该对象的页面)。

于 2013-01-25T17:02:05.057 回答