19

java - 如何从java中的unicode Basic Multilingual Plane之外匹配字符(目的是删除它们)?

4

2 回答 2

25

要删除所有非 BMP 字符,应执行以下操作:

String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");
于 2010-10-27T17:19:54.627 回答
4

您是在寻找 BMP 之外的特定字符还是所有字符?

如果是前者,您可以使用 aStringBuilder构造一个包含来自更高平面的代码点的字符串,并且正则表达式将按预期工作:

  String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
  Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());

  Matcher matcher = regex.matcher(test);
  matcher.find();
  System.out.println(matcher.start());

如果您要从字符串中删除所有非 BMP 字符,那么我会StringBuilder直接使用而不是正则表达式:

  StringBuilder sb = new StringBuilder(test.length());
  for (int ii = 0 ; ii < test.length() ; )
  {
     int codePoint = test.codePointAt(ii);
     if (codePoint > 0xFFFF)
     {
        ii += Character.charCount(codePoint);
     }
     else
     {
        sb.appendCodePoint(codePoint);
        ii++;
     }
  }
于 2010-10-27T17:10:57.370 回答