我正在构建一个大型 Lucene 索引,并且我插入的每个文档都需要在插入之前进行一些“组合”。我正在从数据库中读取所有文档并将它们插入到索引中。Lucene 允许您构建一些不同的索引并在以后将它们合并在一起,所以我想出了这个:
// we'll use a producer/consumer pattern for the job
var documents = new BlockingCollection<Document>();
// we'll have a pool of index writers (each will create its own index)
var indexWriters = new ConcurrentBag<IndexWriter>();
// start filling the collection with documents
Task writerTask = new Task(() => {
foreach(document in database)
documents.Add(document);
domains.CompleteAdding();
}, TaskCreationOptions.LongRunning);
writerTask.Start();
// iterate through the collection, obtaining index writers from the pool and
// creating them when necessary.
Parallel.ForEach(documents.GetConsumingEnumerable(token.Token), document =>
{
IndexWriter writer;
if(!indexWriters.TryTake(out writer))
{
var dirInfo = new DirectoryInfo(string.Concat(_indexPath, "\\~", Guid.NewGuid().ToString("N")));
dirInfo.Create();
var dir = FSDirectory.Open(dirInfo);
var indexWriter = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
// prepare and insert the document into the current index
WriteDocument(writer, document);
indexWriters.Add(writer); // put the writer back in the pool
});
// now get all of the writers and merge the indexes together...
我唯一让我犹豫的是,每次迭代从池中拉出一个 IndexWriter(然后在最后放回)可能比仅仅创建最佳线程数开始效率低,但我也知道ConcurrentBag 非常高效,处理开销极低。
我的解决方案好吗?或者它是否在寻求更好的解决方案?
更新:
经过一些测试,从数据库加载比我认为的实际索引要慢一些。此外,最终的索引合并也很慢,因为我只能使用一个线程,并且我将 16 个索引与大约 170 万个文档合并。不过,我对最初的问题持开放态度。