search - 跨不同数据源的搜索策略

Question

我正在构建一个基于许多属性搜索人的工具。这些属性的值分散在多个系统中。

例如，dateOfBirth 作为系统 ABC 的一部分存储在 SQL Server 数据库中。该人的销售区域分配存储在一些可怕的遗留数据库中。其他属性存储在只能通过 XML Web 服务访问的系统中。

更糟糕的是，遗留数据库和 Web 服务可能真的很慢。

在所有这些系统中实施搜索时，我应该考虑哪些策略和技巧？

注意：虽然我发布了一个答案，但我不相信它是一个很好的答案。除非没有其他人提供更好的见解，否则我不打算接受我自己的答案。

score 4 · Accepted Answer

You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.

Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.

Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.

score 1 · Accepted Answer

While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.

This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.

It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.

score 1 · Accepted Answer

如果您可以摆脱限制性搜索，请首先根据与最快数据源对应的搜索条件返回一个列表。然后将这些记录与其他系统合并，并删除与搜索条件不匹配的记录。

如果你必须实现 OR 逻辑，这种方法是行不通的。

score 0 · Accepted Answer

你看过YQL吗？它可能不是完美的解决方案，但我可能会为您提供工作的起点。

score 0 · Accepted Answer

您是否考虑过将数据移动到单独的结构中？

例如，Lucene 将要搜索的数据存储在无模式倒排索引中。您可以有一个单独的程序从所有不同的来源检索数据并将它们放入 Lucene 索引中。您的搜索可以针对此索引进行，搜索结果可能包含唯一标识符及其来源系统。

http://lucene.apache.org/java/docs/ （也有其他语言的实现）

score 0 · Accepted Answer

使用 Pentaho/Kettle 将您可以搜索和显示的所有数据字段复制到本地 MySQL 数据库
http://www.pentaho.com/products/data_integration/

创建一个批处理脚本以每晚运行并更新您的本地副本。甚至可能每小时。然后，针对本地 MySQL 数据库编写查询并显示结果。

score 0 · Accepted Answer

好吧，对于初学者来说，我会将查询并行化到不同的系统。这样我们可以最小化查询时间。

您可能还想考虑为后续查询缓存和聚合搜索属性以加快速度。

您可以选择创建聚合所有不同系统的聚合服务或中间件，以便您可以提供单一接口进行查询。如果你这样做，这就是我之前提到的缓存和并行优化的地方。

但是，有了所有这些，您将需要权衡将旧的遗留数据库迁移到更快更现代的数据库所带来的开发时间/部署时间/长期收益。您还没有说这些数据库与其他系统的关联程度如何，因此在短期内它可能不是一个非常可行的选择。

编辑：响应数据过期。如果您不需要数据始终与数据库实时匹配，则可以考虑缓存数据。此外，如果某些数据不经常更改（例如出生日期），那么您应该缓存它们。如果您使用缓存，那么您可以使您的系统可配置以包含或从缓存中排除哪些表/列，并且您可以为每个表/列提供具有整体默认值的可个性化缓存超时。

search - 跨不同数据源的搜索策略

7 回答 7

Related

Reference