.net - 这个并行化代码的效率如何？有更好的方法吗？

Question

我正在构建一个大型 Lucene 索引，并且我插入的每个文档都需要在插入之前进行一些“组合”。我正在从数据库中读取所有文档并将它们插入到索引中。Lucene 允许您构建一些不同的索引并在以后将它们合并在一起，所以我想出了这个：

// we'll use a producer/consumer pattern for the job
var documents = new BlockingCollection<Document>();

// we'll have a pool of index writers (each will create its own index)
var indexWriters = new ConcurrentBag<IndexWriter>();

// start filling the collection with documents
Task writerTask = new Task(() => {
    foreach(document in database)
        documents.Add(document);
    domains.CompleteAdding();
}, TaskCreationOptions.LongRunning);
writerTask.Start();

// iterate through the collection, obtaining index writers from the pool and
// creating them when necessary.
Parallel.ForEach(documents.GetConsumingEnumerable(token.Token), document =>
{
    IndexWriter writer;
    if(!indexWriters.TryTake(out writer))
    {
        var dirInfo = new DirectoryInfo(string.Concat(_indexPath, "\\~", Guid.NewGuid().ToString("N")));
        dirInfo.Create();
        var dir = FSDirectory.Open(dirInfo);
        var indexWriter = new IndexWriter(dir, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
    }
    // prepare and insert the document into the current index
    WriteDocument(writer, document);
    indexWriters.Add(writer); // put the writer back in the pool
});

// now get all of the writers and merge the indexes together...

我唯一让我犹豫的是，每次迭代从池中拉出一个 IndexWriter（然后在最后放回）可能比仅仅创建最佳线程数开始效率低，但我也知道ConcurrentBag 非常高效，处理开销极低。

我的解决方案好吗？或者它是否在寻求更好的解决方案？

更新：

经过一些测试，从数据库加载比我认为的实际索引要慢一些。此外，最终的索引合并也很慢，因为我只能使用一个线程，并且我将 16 个索引与大约 170 万个文档合并。不过，我对最初的问题持开放态度。

score 1 · Accepted Answer

Parallel.ForEach我看到的一个问题是，当 CPU 利用率较低时，它可以决定为每个内核添加正常线程之外的线程。这对于等待远程服务器响应的任务是有意义的，但对于缓慢的磁盘密集型进程，这有时会导致性能不佳，因为磁盘现在正在抖动。

如果您的处理受磁盘限制而不是 CPU 限制，您可能需要尝试添加 aParallelOptions并将其设置MaxDegreeOfParallelism为您的Parallel.ForEach，以确保它不会不必要地颠簸磁盘。

.net - 这个并行化代码的效率如何？有更好的方法吗？

1 回答 1

Related

Reference