java - Removing all non-word characters in a Cyrillic UTF-8 encoded String

Question

Normally, in order to remove non-word characters from a String the replaceAll method can be used:

String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");

The above returns a cleaned string "somestringwithnonwordssuchas".

However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.

What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?

score 7 · Accepted Answer

试试[^\\p{L}]。这应该匹配除字母之外的每个 Unicode 代码点。

该类对可能的字符Pattern类进行了非常全面的描述。请注意，POSIX 字符类默认情况下仅支持 ASCII，对您没有多大帮助，您需要使用 Unicode 特定的类。

请注意，有一个UNICODE_CHARACTER_CLASS标志可以更改 POSIX 类的行为以符合Unicode 标准的这一部分（基本上使它们等同于它们最接近的 Unicode 感知等价物）。

java - Removing all non-word characters in a Cyrillic UTF-8 encoded String

1 回答 1

Related

Reference