java - 将补充 Unicode 字符映射到 BMP（如果可能）

Question

我遇到了我的 XML 解析器 (VTD-XML) 似乎无法处理 Unicode 补充字符的问题（如果我在这里已经错了，请更正）。看来，解析器只使用这些字符的低 16 位。

我无法在我正在处理的项目中切换到另一个解析器。我正在解析 Medline 的摘要（https://www.ncbi.nlm.nih.gov/pubmed），似乎在过去一年中添加了包含补充字符的文档（例如https://www.ncbi.nlm. nih.gov/pubmed/?term=26855708，结果部分结束）。

作为一个快速而肮脏的修复，我会从文档中删除所有高于 0xFFFF 的字符。显然，这会破坏文档文本中的一些表达，所以我对这个解决方案并不满意。

由于我无法更改解析器，我想知道是否存在将补充字符映射到 BMP 中可能具有类似外观的字形（如果存在）的字符的可能性。

当然，我欢迎任何其他想法。甚至可以用某种占位符替换补充字符，然后将原始字符放回原处，但这似乎容易出错。更好的想法？

编辑：这里有一些 - 希望 - 这个问题如何与 VTD-XML 一起出现的最小示例：

@Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
    // character codpoint 0x10400
    String unicode = "<supplementary>\uD801\uDC00</supplementary>";
    byte[] unicodeBytes = unicode.getBytes();
    assertEquals(unicode, new String(unicodeBytes, "UTF-8"));

    VTDGen vg = new VTDGen();
    vg.setDoc(unicodeBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    long fragment = vn.getContentFragment();
    int offset = (int) fragment;
    int length = (int) (fragment >> 32);
    String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
    String vtdString = vn.toRawString(offset, length);
    // this actually succeeds
    assertEquals("\uD801\uDC00", originalBytePortion);
    // this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
    assertEquals("\uD801\uDC00", vtdString);
}

java - 将补充 Unicode 字符映射到 BMP（如果可能）

0 回答 0

Related

Reference