26

我正在尝试比较没有任何标点符号、空格、重音符号等的名称。目前我正在执行以下操作:

-(NSString*) prepareString:(NSString*)a {
    //remove any accents and punctuation;
    a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];

    a=[a stringByReplacingOccurrencesOfString:@" " withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"'" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"`" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"-" withString:@""];
    a=[a stringByReplacingOccurrencesOfString:@"_" withString:@""];
    a=[a lowercaseString];
    return a;
}

但是,我需要为数百个字符串执行此操作,并且需要提高效率。有任何想法吗?

4

13 回答 13

81
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
于 2009-08-05T08:54:40.117 回答
39

在使用任何这些解决方案之前,不要忘记使用decomposedStringWithCanonicalMapping分解任何重音字母。例如,这会将 é (U+00E9) 变成 e ‌́ (U+0065 U+0301)。然后,当您去除非字母数字字符时,将保留非重音字母。

这很重要的原因是您可能不希望“dän”和“dün”* 被视为相同。如果你去掉所有重音字母,就像这些解决方案中的一些可能做的那样,你最终会得到“dn”,所以这些字符串会比较相等。

所以,你应该先分解它们,这样你就可以去掉重音,留下字母。

*来自德语的示例。感谢 Joris Weimar 提供。

于 2009-08-05T15:46:14.710 回答
15

在类似的问题上,Ole Begemann 建议使用 stringByFoldingWithOptions:我相信这是最好的解决方案:

NSString *accentedString = @"ÁlgeBra";
NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];

根据您要转换的字符串的性质,您可能希望设置一个固定的语言环境(例如英语),而不是使用用户的当前语言环境。这样,您可以确保在每台机器上获得相同的结果。

于 2013-12-04T03:37:21.970 回答
7

如果您尝试比较字符串,请使用这些方法之一。不要试图更改数据。

- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale

您需要考虑用户区域设置才能使用字符串编写内容,尤其是名称之类的内容。在大多数语言中,像 ä 和 å 之类的字符除了看起来相似之外并不相同。它们本质上是不同的字符,其含义与其他字符不同,但实际规则和语义因每个语言环境而异。

比较和排序字符串的正确方法是考虑用户的语言环境。其他任何事情都是幼稚的,错误的,非常 1990 年代。别这样了。

如果您尝试将数据传递到不支持非 ASCII 的系统,那么这是错误的做法。将其作为数据 blob 传递。

https://developer.apple.com/library/ios/documentation/cocoa/Conceptual/Strings/Articles/SearchingStrings.html

加上首先规范化你的字符串(参见 Peter Hosey 的帖子)预组合或分解,基本上选择一个规范化的形式。

- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping

不,它并不像我们想象的那么简单和容易。是的,它需要知情和谨慎的决策。(以及一些非英语语言经验会有所帮助)

于 2013-12-10T06:34:14.943 回答
7

BillyTheKid18756 的答案有一个重要的精确度(Luiz 对此进行了更正,但在代码解释中并不明显):

不要 stringWithCString用作删除重音的第二步,它可以在字符串末尾添加不需要的字符,因为 NSData 不是以 NULL 结尾的(正如 stringWithCString 所期望的那样)。或者使用它并向您的 NSData 添加一个额外的 NULL 字节,就像 Luiz 在他的代码中所做的那样。

我认为一个更简单的答案是替换:

NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

经过:

NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

如果我收回 BillyTheKid18756 的代码,这里是完整的正确代码:

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];

// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];
于 2012-07-26T10:29:16.330 回答
4

考虑使用RegexKit 框架。您可以执行以下操作:

NSString *searchString      = @"This is neat.";
NSString *regexString       = @"[\W]";
NSString *replaceWithString = @"";
NSString *replacedString    = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];

NSLog (@"%@", replacedString);
//... Thisisneat
于 2009-08-05T08:12:39.390 回答
4

为了给出一个完整的例子,结合 Luiz 和 Peter 的答案,添加几行,你会得到下面的代码。

该代码执行以下操作:

  1. 创建一组可接受的字符
  2. 将重音字母变成普通字母
  3. 删除不在集合中的字符

Objective-C

// The input text
NSString *text = @"BûvérÈ!@$&%^&(*^(_()-*/48";

// Create set of accepted characters
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:@" _-.!"];

// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];

// Remove characters not in the set
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:@""];

斯威夫特 (2.2) 示例

let text = "BûvérÈ!@$&%^&(*^(_()-*/48"

// Create set of accepted characters
let acceptedCharacters = NSMutableCharacterSet()
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
acceptedCharacters.addCharactersInString(" _-.!")

// Turn accented letters into normal letters (optional)
let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)

// Remove characters not in the set
let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
let output = components.joinWithSeparator("")

输出

这两个示例的输出都是:BuverE!_-48

于 2012-02-15T11:48:38.777 回答
4

考虑使用NSScanner,特别是方法-setCharactersToBeSkipped:(接受 NSCharacterSet)和-scanString:intoString:(接受字符串并通过引用返回扫描的字符串)。

您可能还想将此与NSDiacriticInsensitiveSearch-[NSString localizedCompare:]选项结合使用。这可以简化删除/替换重音符号的工作,因此您可以专注于删除标点符号、空格等。-[NSString compare:options:]

如果您必须使用您在问题中提出的方法,至少使用 NSMutableString 并且replaceOccurrencesOfString:withString:options:range:- 这将比创建大量几乎相同的自动释放字符串更有效。可能只是减少分配的数量将暂时“足够”提高性能。

于 2009-08-05T13:51:31.470 回答
3

刚刚碰到这个,也许为时已晚,但这对我有用:

// text is the input string, and this just removes accents from the letters

// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
                                  allowLossyConversion:YES];

// increase length by 1 adds a 0 byte (increaseLengthBy 
// guarantees to fill the new space with 0s), effectively turning 
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];

// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
于 2011-06-28T22:47:41.843 回答
1
@interface NSString (Filtering)
    - (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet;
@end

@implementation NSString (Filtering)
    - (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet {
      NSMutableString * mutString = [NSMutableString stringWithCapacity:[self length]];
      for (int i = 0; i < [self length]; i++){
        char c = [self characterAtIndex:i];
        if(![charSet characterIsMember:c]) [mutString appendFormat:@"%c", c];
      }
      return [NSString stringWithString:mutString];
    }
@end
于 2012-11-19T19:27:36.583 回答
1

这些答案对我来说没有按预期工作。具体来说,decomposedStringWithCanonicalMapping没有像我预期的那样去除重音/变音符号。

这是我用来回答简短内容的变体:

// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: @"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:@""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
于 2014-12-02T14:40:26.373 回答
0

彼得在 Swift 中的解决方案:

let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")

例子:

let oldString = "Jo_ - h !. nn y"
// "Jo_ - h !. nn y"
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet)
// ["Jo", "h", "nn", "y"]
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
// "Johnny"
于 2016-03-27T12:08:29.587 回答
-1

我想过滤掉除字母和数字之外的所有内容,因此我调整了 Lorean 在 NSString 上的 Category 实现,使其工作方式有所不同。在此示例中,您指定了一个仅包含您想要保留的字符的字符串,而其他所有内容都将被过滤掉:

@interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
@end


@implementation NSString (PraxCategories)

+ (NSString *)lettersAndNumbers { return @"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }

- (NSString*)stringByKeepingOnlyLettersAndNumbers {
    return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}

- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
    NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
    NSMutableString * mutableString = @"".mutableCopy;
    for (int i = 0; i < [self length]; i++){
        char character = [self characterAtIndex:i];
        if([characterSet characterIsMember:character]) [mutableString appendFormat:@"%c", character];
    }
    return mutableString.copy;
}

@end

一旦你创建了你的类别,使用它们就很简单了,你可以在任何NSString 上使用它们:

NSString *string = someStringValueThatYouWantToFilter;

string = [string stringByKeepingOnlyLettersAndNumbers];

或者,例如,如果您想摆脱除元音之外的所有内容:

string = [string stringByKeepingOnlyCharactersInString:@"aeiouAEIOU"];

如果您仍在学习 Objective-C 并且没有使用类别,我鼓励您尝试一下。它们是放置此类内容的最佳位置,因为它为您分类的类的所有对象提供了更多功能。

类别简化并封装了您添加的代码,使其易于在您的所有项目中重用。这是Objective-C的一个很棒的特性!

于 2015-01-10T05:03:40.390 回答