iphone - 正则表达式模式和/或 NSRegularExpression 在非常大的文件上搜索有点太慢，可以优化吗？

Question

在 iOS 框架中，我在这个 3.2 MB 的文件中搜索发音：https ://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

我正在使用 NSRegularExpression 搜索作为 NSArray 给出的任意一组单词。搜索是通过大文件的内容作为 NSString 完成的。我需要匹配由换行符和制表符括起来的任何单词，然后抓取整行，例如，如果我的 NSArray 中有单词“monday”，我想在字典文件中匹配这一行：

monday  M AH N D IY

此行以换行符开头，字符串“monday”后跟一个制表符，然后是发音。整行需要由正则表达式匹配才能最终输出。我还需要找到以下列出的单词的替代发音：

monday(2)   M AH N D EY

替代发音总是以 (2) 开头，最高可达 (5)。因此，我还搜索单词的迭代，后跟括号，其中包含由换行符和制表符括起来的单个数字。

我有一个 100% 工作的 NSRegularExpression 方法，如下所示：

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

我的问题是，考虑到大文本文件，它在 iPhone 上的速度还不够快。在 iPhone 4 上 8 个单词需要 1.3 秒，这对于应用程序来说太长了。鉴于以下已知因素：

• 3.2 MB 的文本文件按字母顺序列出要匹配的单词

• 使用此方法时，要查找的任意单词数组始终按字母顺序排列

• 替代发音在单词后面的括号中以 (2) 开头，而不是 (1)

• 如果没有 (2)，则不会有 (3)、(4) 或更多

• 一种替代发音的出现很少见，平均可能出现 8 次中的 1 次。更多的替代发音更加罕见。

可以通过改进正则表达式或 Objective-C 的某些方面来优化此方法吗？我假设 NSRegularExpression 已经足够优化，不值得尝试用不同的 Objective-C 库或 C 语言来做，但如果我在这里错了，请告诉我。否则，非常感谢任何关于提高性能的建议。我希望将其推广到任何发音文件，因此我试图远离解决方案，例如提前计算字母范围以进行更多受限搜索。

****编辑****

以下是 2012 年 8 月 16 日给出的所有与搜索相关的答案在 iPhone 4 上的时间安排：

dasblinkenlight 的创建 NSDictionary 方法https://stackoverflow.com/a/11958852/119717 : 5.259676 秒

Ωmega 在https://stackoverflow.com/a/11957535/119717上最快的正则表达式：0.609593 秒

dasblinkenlight 在https://stackoverflow.com/a/11969602/119717的多个 NSRegularExpression 方法：1.255130 秒

我在https://stackoverflow.com/a/11970549/119717的第一个混合方法：0.372215 秒

我在https://stackoverflow.com/a/11970549/119717的第二种混合方法：0.337549 秒

迄今为止最好的时间是我答案的第二个版本。我无法将任何答案标记为最佳，因为所有与搜索相关的答案都告知了我在我的版本中采用的方法，因此它们都非常有用，而我的答案只是基于其他答案。我学到了很多东西，我的方法结束了原来的四分之一，所以这非常有帮助，感谢 dasblinkenlight 和 Ωmega 与我讨论。

score 4 · Accepted Answer

由于无论如何您都将整个文件放入内存中，因此您不妨将其表示为易于搜索的结构：

创建一个可变的NSDictionary words，带有NSString键和NSMutableArray值
将文件读入内存
逐行遍历代表文件的字符串
对于每一个line，通过搜索一个'('或一个'\t'字符来分离单词部分
获取单词的子字符串（从零到'('或'\t'减一的索引）；这是你的key。
检查是否words包含您的key; 如果没有，添加新的NSMutableArray
添加line到NSMutableArray您在特定位置找到/创建的key
完成后，丢弃代表文件的原始字符串。

有了这个结构，您应该能够及时进行搜索，而没有正则表达式引擎能够匹配，因为您将线性的全文扫描替换为恒定的哈希查找 -时间。

** 编辑：** 我检查了这个解决方案与正则表达式的相对速度，它在模拟器上快了大约 60 倍。这一点也不奇怪，因为基于正则表达式的解决方案的可能性很大。

读取文件：

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

搜索：

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

score 4 · Accepted Answer

试试这个：

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

还有这个（使用带有可能首字母列表的正向前瞻）：

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

最后，有一些优化的版本：

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

score 2 · Accepted Answer

这是我的 dasblinkenlight 和 Ωmega 答案的混合方法，我认为此时我也应该将其添加为答案。它使用 dasblinkenlight 的方法对字符串进行前向搜索，然后在命中时在小范围内执行完整的正则表达式，因此它利用了字典和要查找的单词都按字母顺序排列的事实，并受益于优化的正则表达式。希望我有两个最佳答案检查要发出！这给出了正确的结果，并且在模拟器上花费了纯正则表达式方法的大约一半时间（我必须稍后在设备上进行测试，以查看作为参考设备的 iPhone 4 上的时间比较是什么）：

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

编辑：这是另一个版本，它不使用正则表达式，并且从该方法中节省了更多时间：

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

score 1 · Accepted Answer

查看您提供的字典文件，我想说一个合理的策略可能是读取数据并将其放入任何类型的持久数据存储中。

通读文件并为每个独特的单词创建对象，并带有n发音字符串（其中n是独特发音的数量）。字典已经按字母顺序排列，因此如果您按照阅读顺序对其进行解析，您最终会得到一个按字母顺序排列的列表。

然后你可以对数据进行二分搜索——即使有大量的对象，二分搜索也会很快找到你要找的东西（假设按字母顺序）。

如果您需要闪电般的性能，您甚至可以将整个内容保存在内存中。

iphone - 正则表达式模式和/或 NSRegularExpression 在非常大的文件上搜索有点太慢，可以优化吗？

4 回答 4

Related

Reference