multithreading - 大规模多线程操作

Question

用下面的新代码编辑

我是多线程的新手，但为了实现我的目标，快速完成并学习新知识，我决定使用多线程应用程序来完成。

目标：从文件中解析大量字符串，并使用 CoreData 将每个单词保存到 SQLite 数据库中。巨大，因为字数约为 300.000 ...

所以这是我的方法。

步骤 1. 将所有单词解析到文件中，并将其放入一个巨大的 NSArray。（快速完成）

步骤 2. 创建插入 NSBlockOperation 的 NSOperationQueue。

主要问题是该过程开始非常快，但很快就会减慢。我正在使用最大并发操作设置为 100 的 NSOperationQueue。我有一个 Core 2 Duo 进程（没有 HT 的双核）。

我看到使用 NSOperationQueue 创建 NSOperation 有很多开销（停止调度队列需要大约 3 分钟才能创建 300k NSOperation。）当我开始调度队列时 CPU 达到 170%。

我还尝试删除 NSOperationQueue 并使用 GDC（300k 循环是瞬时完成的（注释行））但使用的 cpu 仅为 95%，问题与 NSOperations 相同。很快这个过程就变慢了。

一些技巧可以做得很好？

这里有一些代码（原始问题代码）：

- (void)inserdWords:(NSArray *)words insideDictionary:(Dictionary *)dictionary {
    NSDate *creationDate = [NSDate date];

    __block NSUInteger counter = 0;

    NSArray *dictionaryWords = [dictionary.words allObjects];
    NSMutableSet *coreDataWords = [NSMutableSet setWithCapacity:words.count];

    NSLog(@"Begin Adding Operations");

    for (NSString *aWord in words) {

        void(^wordParsingBlock)(void) = ^(void) {
            @synchronized(dictionary) {
                NSManagedObjectContext *context = [(PRDGAppDelegate*)[[NSApplication sharedApplication] delegate] managedObjectContext];                

                [context lock];

                Word *toSaveWord = [NSEntityDescription insertNewObjectForEntityForName:@"Word" inManagedObjectContext:context];
                [toSaveWord setCreated:creationDate];
                [toSaveWord setText:aWord];
                [toSaveWord addDictionariesObject:dictionary];

                [coreDataWords addObject:toSaveWord];
                [dictionary addWordsObject:toSaveWord];

                [context unlock];

                counter++;
                [self.countLabel performSelectorOnMainThread:@selector(setStringValue:) withObject:[NSString stringWithFormat:@"%lu/%lu", counter, words.count] waitUntilDone:NO];

            }
        };

        [_operationsQueue addOperationWithBlock:wordParsingBlock];
//        dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
//        dispatch_async(queue, wordParsingBlock);
    }
    NSLog(@"Operations Added");
}

先感谢您。

编辑...

感谢 Stephen Darlington，我重写了我的代码，我发现了问题所在。最重要的是：不要在 Thread 之间共享 CoreData 对象……这意味着不要混合不同上下文检索到的 Core 数据对象。

这让我使用了导致慢动作代码执行的@synchronized(dictionary)！比我只使用 MAXTHREAD 实例删除了大量的 NSOperation 创建。（2 或 4 而不是 300k ......差别很大）

现在我可以在 30/40 秒内解析 300k+ 字符串。感人的！！我仍然有一些问题（接缝它解析的单词比仅用 1 个线程解析的单词多，如果线程超过 1 个，它不会解析所有单词......我需要弄清楚）但现在代码真的很有效。也许下一步可能是使用 OpenCL 并将其注入 GPU :)

这里是新代码

- (void)insertWords:(NSArray *)words forLanguage:(NSString *)language {
    NSDate *creationDate = [NSDate date];
    NSPersistentStoreCoordinator *coordinator = [(PRDGAppDelegate*)[[NSApplication sharedApplication] delegate] persistentStoreCoordinator];

    // The number of words to be parsed by the single thread.
    NSUInteger wordsPerThread = (NSUInteger)ceil((double)words.count / (double)MAXTHREADS);

    NSLog(@"Start Adding Operations");
    // Here I minimized the number of threads. Every thread will parse and convert a finite number of words instead of 1 word per thread.
    for (NSUInteger threadIdx = 0; threadIdx < MAXTHREADS; threadIdx++) {

        // The NSBlockOperation.
        void(^threadBlock)(void) = ^(void) {
            // A new Context for the current thread.
            NSManagedObjectContext *context = [[NSManagedObjectContext alloc] init];
            [context setPersistentStoreCoordinator:coordinator];

            // Dictionary now is in accordance with the thread context.
            Dictionary *dictionary = [PRDGMainController dictionaryForLanguage:language usingContext:context];

            // Stat Variable. Needed to update the UI.
            NSTimeInterval beginInterval = [[NSDate date] timeIntervalSince1970];
            NSUInteger operationPerInterval = 0;

            // The NSOperation Core. It create a CoreDataWord.
            for (NSUInteger wordIdx = 0; wordIdx < wordsPerThread && wordsPerThread * threadIdx + wordIdx < words.count; wordIdx++) {
                // The String to convert
                NSString *aWord = [words objectAtIndex:wordsPerThread * threadIdx + wordIdx];

                // Some Exceptions to skip certain words.
                if (...) {
                    continue;
                }

                // CoreData Conversion.
                Word *toSaveWord = [NSEntityDescription insertNewObjectForEntityForName:@"Word" inManagedObjectContext:context];
                [toSaveWord setCreated:creationDate];
                [toSaveWord setText:aWord];
                [toSaveWord addDictionariesObject:dictionary];

                operationPerInterval++;

                NSTimeInterval endInterval = [[NSDate date] timeIntervalSince1970];

                // Update case.
                if (endInterval - beginInterval > UPDATE_INTERVAL) {

                    NSLog(@"Thread %lu Processed %lu words", threadIdx, wordIdx);

                    // UI Update. It will be updated only by the first queue.
                    if (threadIdx == 0) {

                        // UI Update code.
                    }
                    beginInterval = endInterval;
                    operationPerInterval = 0;
                }
            }

            // When the NSOperation goes to finish the CoreData thread context is saved.
            [context save:nil];
            NSLog(@"Operation %lu finished", threadIdx);
        };

        // Add the NSBlockOperation to queue.
        [_operationsQueue addOperationWithBlock:threadBlock];
    }
    NSLog(@"Operations Added");
}

score 2 · Accepted Answer

一些想法：

将最大并发操作数设置得如此之高不会产生太大影响。如果您有两个核心，则不太可能超过两个
看起来您NSManagedObjectContext对所有流程都使用相同的。不是很好
假设您的最大并发操作数为100。瓶颈将是主线程，您尝试在其中更新每个操作的标签。尝试为每n条记录而不是每条记录更新主线程
如果您正确使用 Core Data，则不需要锁定上下文……这意味着为每个线程使用不同的上下文
你似乎从来没有保存上下文？
批处理操作是提高性能的好方法......但请参阅上一点
正如您所建议的，创建 GCD 操作会产生开销。为每个单词创建一个新单词可能不是最佳选择。您需要平衡创建新流程的开销与并行化的好处

简而言之，即使您使用 GCD 之类的东西，线程也很难。

score 0 · Accepted Answer

没有测量和分析很难做到，但在我看来可疑的是你保存了迄今为止保存的完整单词词典，并保存了每个单词。所以每次保存的数据量越来越大。

// the dictionary at this point contains all words saved so far
// which each contains a full dictionary
[toSaveWord addDictionariesObject:dictionary];

// add each time so it gets bigger each time
[dictionary addWordsObject:toSaveWord];

因此，每次保存都会保存越来越多的数据。为什么要保存每个单词的所有单词的字典？

其他一些想法：

为什么要建立你从不使用的 coreDataWords？
我想知道您是否获得了您在同步整个工作块时的并发性。

要尝试的事情：

除了您正在构建的字典之外，在 toSaveWord 上注释掉字典并重试 - 看看它是您的数据/数据结构还是 DB/coreData。
做第一个，但也要创建它的串行版本，看看你是否真的获得了并发收益。

multithreading - 大规模多线程操作

2 回答 2

Related

Reference