c++ - PoDoFo 波兰语字符和 PdfContentsTokenizer 错误

Question

1.

如何从pdf文件中获取波兰语字符？我能以某种方式告诉

PdfVariant::getString()

它会处理波兰语字符吗？因为我得到 \200了而不是ł例如，有趣的是那只是ł作为第一个“非基本”字符出现时。因此，如果 pdf 文件以开头aaaałęąaaaa，则被ł编码为\200、ę类似\201和ą类似，\202 但如果 pdf 文件以、类似和类似开头aaaaąęłaaaa，我如何在任何系统中获取这些字符？ł\202ę\201ą\200

2.

当我尝试从 pdf 文件中提取文本时，我会执行以下操作：

string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
        PdfPage* page = pdf.GetPage(pn); 
        PdfContentsTokenizer tok(page);
        const char* token = nullptr;
        PdfVariant var;
        EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) {
           //etc.

但是我遇到了PdfContentsTokenizer tok(page);它无法正常工作的问题。对于某些 pdf 文件，它运行顺利，而对于其他文件，它会在文件中抛出Access violation reading location错误，行：inffas32.asm669

L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov  eax, [ebx+eax*4]//this is the error line

顺便说一句，我注意到并非每个 pdf 文件都以相同的方式编码。例如，使用 podofobrowser 我看不到Hello World!官方 podofo helloworld 示例中的文本。对于其他 pdf 文件，podofobrowser 以不同的方式显示文本或根本不显示。

score 0 · Accepted Answer

广告 1. 补丁文件的链接，允许使用TextExtractor 从 pdf 中提取波兰文文本。

这是从 pdf 中提取非 unicode 文本时最重要的一行：

PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );

广告 2.问题是 zlib 库构建错误。我重建它，重建了podofo，问题就消失了。

c++ - PoDoFo 波兰语字符和 PdfContentsTokenizer 错误

1 回答 1

Related

Reference