Normally, in order to remove non-word characters from a String the replaceAll
method can be used:
String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");
The above returns a cleaned string "somestringwithnonwordssuchas".
However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.
What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?