3

Normally, in order to remove non-word characters from a String the replaceAll method can be used:

String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");

The above returns a cleaned string "somestringwithnonwordssuchas".

However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.

What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?

4

1 回答 1

7

试试[^\\p{L}]。这应该匹配除字母之外的每个 Unicode 代码点。

该类对可能的字符Pattern进行了非常全面的描述。请注意,POSIX 字符类默认情况下仅支持 ASCII,对您没有多大帮助,您需要使用 Unicode 特定的类。

请注意,有一个UNICODE_CHARACTER_CLASS标志可以更改 POSIX 类的行为以符合Unicode 标准的这一部分(基本上使它们等同于它们最接近的 Unicode 感知等价物)。

于 2012-08-23T08:24:08.950 回答