java - 从字符串中过滤掉 UTF-8 标点和符号

Question

从字符串中过滤掉所有 UTF-8 标点字符和符号（如 ✀ ✁ ✂ ✃ ✄ ✅ ✆ ✇ ✈ 等）的最佳和最有效的方法是什么。简单地过滤掉所有不在 az、AZ 和 0-9 中的字符不是一个选项，因为我想保留来自其他语言的字母（ą、ę、ó 等）提前谢谢。

score 3 · Accepted Answer

尝试 unicode二进制分类的组合：

String fixed = value.replaceAll("[^\\p{IsAlphabetic}\\p{IsDigit}]", "");

score 3 · Accepted Answer

您可以使用\p{L}匹配所有 unicode 字母。例子：

public static void main(String[] args) throws IOException {
    String[] test = {"asdEWR1", "ąęóöòæûùÜ", "sd,", "✀","✁","✂","✃","✄","✅","✆","✇","✈"};
    for (String s : test)
        System.out.println(s + " => " + s.replaceAll("[^\\p{L}^\\d]", ""));
}

输出：

asdEWR1 => asdEWR1
ąęóöòæûùÜ => ąęóöòæûùÜ
sd, => sd
✀ => 
✁ => 
✂ => 
✃ => 
✄ => 
✅ => 
✆ => 
✇ => 
✈ =>

score 1 · Accepted Answer

这个想法是首先删除重音。

public static String onlyASCII(String s) {
    // Decompose any ŝ into s and combining-^.
    String s2 = Normalizer.normalize(s, Normalizer.Form.NFD);
    // Removee all non-ASCII
    return s2.replaceAll("[^\\u0000-\\u007E\\pL]", "");
}

对于希腊字母和此类\\pL字母。

score 0 · Accepted Answer

“标点符号”一词相当模糊。该类Character提供了一个getType()方法，该方法至少映射到 Unicode 规范中定义的一些字符类别，因此这可能是最好的起点。

我建议也应用“正”逻辑（例如，所有字符和数字）而不是“负”逻辑（没有标点符号），因为测试可能要简单得多。

java - 从字符串中过滤掉 UTF-8 标点和符号

4 回答 4

Related

Reference