java - 替换所有非 latin-1 字符的 API 或方法

Question

我正在处理第 3 方 API / Web 服务，它们只允许在其 XML 中设置 latin-1 字符集。是否有现有的 API / 方法可以查找和替换字符串中的所有非 latin-1 字符？

例如：凯文

有没有办法让那个凯文？

score 2 · Accepted Answer

使用 ICU4J，

public String removeAccents(String text) {
    return Normalizer.decompose(text, false, 0)
                 .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

我在http://glaforge.appspot.com/article/how-to-remove-accents-from-a-string找到了这个例子

在 java 1.6 中，必要的规范器可能是内置的。

score 0 · Accepted Answer

我遇到了很多关于如何删除所有口音的帖子。这篇（旧的！）帖子涵盖了我的用例，所以我将在这里分享我的解决方案。就我而言，我只想替换 ISO-8859-1 字符集中不存在的字符。用例是：读取 UTF-8 文件，并将其写入 ISO-8859-1 文件，同时保留尽可能多的特殊字符（但要防止 UnmappableCharacterException）。

@Test
void proofOfConcept() {
    final String input = "Bełchatöw";
    final String expected = "Belchatöw";
    final String actual = MyMapper.utf8ToLatin1(input);
    Assertions.assertEquals(expected, actual);
}

Normalizer似乎很有趣，但我只找到了删除所有重音的方法。

public static String utf8ToLatin1(final String input) {
    return Normalizer.normalize(input, Normalizer.Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

奇怪的是，上面的代码不仅失败了，而且

expected: <Belchatöw> but was: <Bełchatow>

CharsetEncoder看起来很有趣，但似乎我只能设置一个静态的“替换”字符（实际上是：字节数组），所以所有不可映射的字符都变成了 '?' 或类似的

public static String utf8ToLatin1(final String input) throws CharacterCodingException {
    final ByteBuffer byteBuffer = StandardCharsets.ISO_8859_1.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE)
        .replaceWith(new byte[] { (byte) '?' })
        .encode(CharBuffer.wrap(input));
    return new String(byteBuffer.array(), StandardCharsets.ISO_8859_1);
}

失败

expected: <Belchatöw> but was: <Be?chatöw>

因此，我的最终解决方案是：

public static String utf8ToLatin1(final String input) {
    final Map<String, String> characterMap = new HashMap<>();
    characterMap.put("ł", "l");
    characterMap.put("Ł", "L");
    characterMap.put("œ", "ö");
    final StringBuffer resultBuffer = new StringBuffer();
    final Matcher matcher = Pattern.compile("[^\\p{InBasic_Latin}\\p{InLatin-1Supplement}]").matcher(input);
    while (matcher.find()) {
        matcher.appendReplacement(resultBuffer,
            characterMap.computeIfAbsent(matcher.group(),
                s -> Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "")));
    }
    matcher.appendTail(resultBuffer);
    return resultBuffer.toString();
}

几点：

characterMap需要扩展到您的需要。对Normalizer重音字符很有用，但您可能还有其他字符。另外，提取characterMap（注意 computeIfAbsent 更新地图，注意并发！）
Pattern.compile() 不应重复调用，将其提取到静态

java - 替换所有非 latin-1 字符的 API 或方法

2 回答 2

Related

Reference