java - PDFBox - 无法编码由代理对组成的字符串

Question

在我的 PDFBox 实现中，我创建了通过测试不同字体以多种语言编写字符串的方法。

PDFont currentFont = PDType0Font.load(pdfDocument, new File("path/to/font/font.ttf"));
for (int offset = 0; offset < sValue.length();) {
    int iCodePoint = sValue.codePointAt(offset);
    boolean isEncodable = isCodePointEncodable(currentFont, iCodePoint);
    //-Further logic here, etc.

    offset += Character.charCount(iCodePoint);
}

private boolean isCodePointEncodable (PDFont currentFont, int iCodePoint) throws IOException {
    StringBuilder st = new StringBuilder();
    st.appendCodePoint(iCodePoint);
    try {
        currentFont.encode(st.toString());
        return true;
    } catch (IllegalArgumentException iae) {
        return false;
    }
}

虽然这适用于基本多语言平面 (BMP) 中的任何内容，但任何涉及 BMP 之外的 unicode 的内容都将不起作用。我已经下载并使用字形图表广泛查看了所涉及的字体，并记录了每个代码。例如，当尝试对 U+1F681（或十进制 128641）进行编码时，我跟踪了日志记录，发现它专门尝试在NotoEmoji-Regular.ttf中对这个字符进行编码，这是正确的匹配字符，并且确实有这个角色。不幸的是，它仍然返回错误。

具体来说，我的日志服务器返回了这个：

Code Point 128641 () cannot be encoded in font NotoEmoji

是否有任何解决方法或解决方案？谢谢你。

score 1 · Accepted Answer

我已经创建并解决了问题PDFBOX-3997。原因是我们没有使用最好的 cmap 子表。

没有解决方法，但该错误将在 2.0.9 版中修复，几个月后。但是您不必等待那么久 - 您可以使用快照构建进行测试。

java - PDFBox - 无法编码由代理对组成的字符串

1 回答 1

Related

Reference