0

有人知道这里会发生什么吗?

第一个块显示了我通常希望看到的内容 - 字符串的第一个字符在索引“0”中,“问题”字符串被注释掉,替换为完全相同的内容,但从未运行过。

public void finderTest(){
    String theDoc = "Hello, I want this to work, and work well! Do you think it will work, and if not, why not?";
    //String wordOne = "‭abc"; // old, pre-used string, used to hold a comma.
    String wordOne = "abc";// new, never run before with a comma
    String wordTwo = "and";
    System.out.println("Type of character at index '0' in theDoc: "+Character.getType(theDoc.charAt(0)));
    System.out.println("Character at index '0' in theDoc: "+theDoc.charAt(0));
    System.out.println();
    System.out.println("All of wordOne: "+"'"+wordOne+"'");
    System.out.println("Type of character at index '0' in wordOne: "+Character.getType(wordOne.charAt(0)));
    System.out.println("Character at index '0' in wordOne: "+wordOne.charAt(0));
    System.out.println();
    System.out.println("Type of Character at index '0' in wordTwo: "+Character.getType(wordTwo.charAt(0)));
    System.out.println("Character at index '0' in wordTwo: "+wordTwo.charAt(0));
}

这给出了输出:

/*
    Type of character at index '0' in theDoc: 1
Character at index '0' in theDoc: H

All of wordOne: 'abc'
Type of character at index '0' in wordOne: 2 // okay
Character at index '0' in wordOne: a // okay

Type of Character at index '0' in wordTwo: 2
Character at index '0' in wordTwo: a
*/

第二个块的'new'字符串被注释掉了,'wordOne'的第一个字符什么都没有。它不是空字符或换行符。我一直在使用该变量在“theDoc”中查找逗号……但是当我运行它时,索引“0”没有任何内容,而索引 1 中包含逗号。如果我复制并粘贴字符串,问题仍然存在。但是,将其注释掉/删除它可以解决问题。

    public void finderTest(){
    String theDoc = "Hello, I want this to work, and work well! Do you think it will work, and if not, why not?";
    String wordOne = "‭abc"; // now running old string, used to hold comma
    //String wordOne = "abc"; 
    String wordTwo = "and";
    System.out.println("Type of character at index '0' in theDoc: "+Character.getType(theDoc.charAt(0)));
    System.out.println("Character at index '0' in theDoc: "+theDoc.charAt(0));
    System.out.println();
    System.out.println("All of wordOne: "+"'"+wordOne+"'");
    System.out.println("Type of character at index '0' in wordOne: "+Character.getType(wordOne.charAt(0)));
    System.out.println("Character at index '0' in wordOne: "+wordOne.charAt(0));
    System.out.println();
    System.out.println("Type of Character at index '0' in wordTwo: "+Character.getType(wordTwo.charAt(0)));
    System.out.println("Character at index '0' in wordTwo: "+wordTwo.charAt(0));
}

这给出了输出:

/*  
    Type of character at index '0' in theDoc: 1
    Character at index '0' in theDoc: H

    All of wordOne: '‭abc'
    Type of character at index '0' in wordOne: 16 // What does this mean?
    Character at index '0' in wordOne: ‭   // where is the a? (well, its in wordOne index '1'... but why??)

    Type of Character at index '0' in wordTwo: 2
    Character at index '0' in wordTwo: a
*/

java中的逗号或符号是否会导致这样的问题?我尝试使用字符数组,清理工作区以重新构建所有内容,但没有任何改变……当某些克像“,和”之类的东西时,这对于在句子中查找“ngrams”的索引是一个巨大的问题。昨晚的某个时候,它正在工作,然后突然开始不工作。我很困惑。

有任何想法吗?

谢谢,

安德鲁

4

3 回答 3

2

我尝试将您的示例粘贴到 Eclipse 中,它告诉我:

某些字符无法使用“Cp1252”字符编码进行映射。

并指出字符串中的第一个字符:

String wordOne = "abc";

似乎在"和之间有一个隐藏(不可打印)字符a

于 2012-02-12T19:35:21.033 回答
1

字符类型 16 对应于 Unicode DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING (U+202B)。这是一个不可打印的字符;你可以打印它的十六进制值来确认。

于 2012-02-12T19:35:32.140 回答
0

您的字符串包含一个您无法看到的字符(在“a”之前)。Unicode 集中有几十个字符没有有意义的视觉表示——这可能就是其中之一。

'16' 是字符类型,例如:

COMBINING_SPACING_MARK, CONNECTOR_PUNCTUATION, CONTROL, CURRENCY_SYMBOL, DASH_PUNCTUATION, DECIMAL_DIGIT_NUMBER, ENCLOSING_MARK, END_PUNCTUATION, FINAL_QUOTE_PUNCTUATION, FORMAT, INITIAL_QUOTE_PUNCTUATION, LETTER_NUMBER, LINE_SEPARATOR, LOWERCASE_LETTER, MATH_SYMBOL, MODIFIER_LETTER, MODIFIER_SYMBOL, NON_SPACING_MARK, OTHER_LETTER, OTHER_NUMBER, OTHER_PUNCTUATION, OTHER_SYMBOL, PARAGRAPH_SEPARATOR, PRIVATE_USE, SPACE_SEPARATOR, START_PUNCTUATION、SURROGATE、TITLECASE_LETTER、UNASSIGNED、UPPERCASE_LETTER

所有这些都在Character类中定义。我不能告诉你它是哪一个,因为这在理论上是依赖于实现的;您应该检查这些值。或者,更好的是,用于Character.getName查找字符的人类可读描述。

于 2012-02-12T19:36:25.250 回答