0

我试图在 Cocoa 应用程序中将字符串标记为单词,但遇到了NLTokenizer.

当输入字符串以 Unicode 类别“Other Symbol”或“Specials”块中的符号开头时,如NSTextAttachment.character,标记化失败(即返回空列表)。

仅当符号后面直接跟不带空格的单词时才会出现此问题(请参见下面的示例)。

用例:

我有一个NSAttributedString可以包含文本中任何位置的图像。这些在内部由对象替换字符 (U+FFFC) 表示。如果文档以图像开头,后跟一个单词而不是空格,则标记化失败。

重现:

/// Splits by natural language words.
static let tokenizeByWord:(String)-> [String] = { input in
    
    let tokenizer = NLTokenizer(unit: .word)
    tokenizer.string = input
    
    var tokens = [String]()
    
    tokenizer.enumerateTokens(in: input.startIndex..<input.endIndex) { tokenRange, _ in
        let token = input[tokenRange]
        tokens.append(String(token))
        return true
    }
    return tokens
}
//  These all fail: (string starts with symbol, followed by word)
XCTAssertEqual(tokenizeByWord("\u{FFFC}hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("©hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("®hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("|hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("\\hello world"), ["hello", "world"])

// ✅ These all pass: (space after symbol)
XCTAssertEqual(tokenizeByWord("\u{FFFC} hello world"), ["\u{FFFC}", "hello", "world"])
XCTAssertEqual(tokenizeByWord("© hello world"), ["©", "hello", "world"])
XCTAssertEqual(tokenizeByWord("® hello world"), ["®", "hello", "world"])
XCTAssertEqual(tokenizeByWord("| hello world"), ["|", "hello", "world"])
XCTAssertEqual(tokenizeByWord("\\ hello world"), ["\\", "hello", "world"])

// ✅ These all pass: (no space, but symbol rigth before second word)
XCTAssertEqual(tokenizeByWord("hello \u{FFFC}world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ©world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ®world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello |world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello \\world"), ["hello", "world"])

// ✅ Emoji pass with and without space:
XCTAssertEqual(tokenizeByWord("hello world" ), ["", "hello", "world"])
XCTAssertEqual(tokenizeByWord(" hello world"), ["", "hello", "world"])

系统:

  • macOS Catalina 10.15.7 (19H2)
  • Xcode 12.4 (12D4e)
4

0 回答 0