我试图在 Cocoa 应用程序中将字符串标记为单词,但遇到了NLTokenizer
.
当输入字符串以 Unicode 类别“Other Symbol”或“Specials”块中的符号开头时,如NSTextAttachment.character
,标记化失败(即返回空列表)。
仅当符号后面直接跟不带空格的单词时才会出现此问题(请参见下面的示例)。
用例:
我有一个NSAttributedString
可以包含文本中任何位置的图像。这些在内部由对象替换字符 (U+FFFC) 表示。如果文档以图像开头,后跟一个单词而不是空格,则标记化失败。
重现:
/// Splits by natural language words.
static let tokenizeByWord:(String)-> [String] = { input in
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = input
var tokens = [String]()
tokenizer.enumerateTokens(in: input.startIndex..<input.endIndex) { tokenRange, _ in
let token = input[tokenRange]
tokens.append(String(token))
return true
}
return tokens
}
// These all fail: (string starts with symbol, followed by word)
XCTAssertEqual(tokenizeByWord("\u{FFFC}hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("©hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("®hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("|hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("\\hello world"), ["hello", "world"])
// ✅ These all pass: (space after symbol)
XCTAssertEqual(tokenizeByWord("\u{FFFC} hello world"), ["\u{FFFC}", "hello", "world"])
XCTAssertEqual(tokenizeByWord("© hello world"), ["©", "hello", "world"])
XCTAssertEqual(tokenizeByWord("® hello world"), ["®", "hello", "world"])
XCTAssertEqual(tokenizeByWord("| hello world"), ["|", "hello", "world"])
XCTAssertEqual(tokenizeByWord("\\ hello world"), ["\\", "hello", "world"])
// ✅ These all pass: (no space, but symbol rigth before second word)
XCTAssertEqual(tokenizeByWord("hello \u{FFFC}world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ©world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ®world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello |world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello \\world"), ["hello", "world"])
// ✅ Emoji pass with and without space:
XCTAssertEqual(tokenizeByWord("hello world" ), ["", "hello", "world"])
XCTAssertEqual(tokenizeByWord(" hello world"), ["", "hello", "world"])
系统:
- macOS Catalina 10.15.7 (19H2)
- Xcode 12.4 (12D4e)