java - 从 UTF 字符串中删除非 Ansi 字符并保留其他字符

Question

我们有一个接受 UTF8 字符串作为输入的 java 库。但是，如果输入中有任何非 ansi 字符的字符，则库可能会崩溃。因此，我们要从字符串中删除所有非 ansi 字符。但是如何在java中做到这一点？

谢谢，

score 1 · Accepted Answer

试试这个，我从这里拉的，所以还没有测试过

// Create a encoder and decoder for the character encoding
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

// This line is the key to removing "unmappable" characters.
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
String result = inString;

try {
    // Convert a string to bytes in a ByteBuffer
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));

    // Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    CharBuffer cbuf = decoder.decode(bbuf);
    result = cbuf.toString();
} catch (CharacterCodingException cce) {
    String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
    cce.printStackTrace()
}

score 0 · Accepted Answer

看看 String.codePointAt(index)。这可以为您提供给定字符的 Unicode 代码点，然后您可以从那里删除范围之外的那些。

您如何处理字符已被删除的事实取决于您，但请记住，您将发送到库的字符串不一定与客户端提供的字符串相同。这可能会也可能不会导致问题。

我不确定您在这里所说的 ANSI 是什么意思。您是指人们通常称为 ANSI 的 Windows 1252 字符编码吗？这不是 ASCII，也不是 IS0-8859-1，因此请确保您的代码页正确无误。

java - 从 UTF 字符串中删除非 Ansi 字符并保留其他字符

2 回答 2

Related

Reference