poppler - pdf2HtmlEX - html 上的文本与源 pdf 不同

Question

我正在使用 topdf2htmlEX将 pdf 文件转换为 html。之后我还从文件中提取文本。

问题：

我遇到一个文件，转换后的 html 中的文本不可读： https ://dspace.mit.edu/openaccess-disseminate/1721.1/101159

我使用的命令：

pdf2htmlEX --tounicode 1 ./file.pdf

html 上的文本有很多空格和很多引号 -

[2]"M."Ha hn,"O."Bar bie ri,"FP."C ampa na,"R."K öt z,"R."G alla y,"A pp l."Ph ys ."A :"M a ter."S ci."Pro ce ss."8 2 "(2 00 6 )"

为 arg 设置其他值--tounicode会使文本变得乱码。

有一个使用这个库的在线工具，并且在那里生成的 html 很好，这使得它不是 pdf2htmlEX 错误，而是配置或版本问题。可能与 poppler 或 fontforge 有关。

版本：

pdf2htmlEX version 0.14.6
Copyright 2012-2015 Lu Wang <coolwanglu@gmail.com> and other contributors
Libraries: 
  poppler 0.54.0
  libfontforge 20180906
  cairo 1.14.6
Default data-dir: /usr/local/share/pdf2htmlEX
Supported image format: png jpg svg

还尝试使用支持该项目的新存储库并获得相同的结果，请参阅问题： https ://github.com/pdf2htmlEX/pdf2htmlEX/issues/92

据您所知，pdf2htmlEX 使用广泛的字符作为空格，例如 " ' ( ) +。因此，不能全部替换它们。

有什么方法可以使 pdf2htmlEX 不使用这些字符？

score -1 · Accepted Answer

我认为以下两个步骤将起作用：

使用正则表达式删除不必要的空格和引号。
为每个引用添加/添加段落标签，如下所示：

<div>
::before
<p>[2] something </p>
::after
</div>

poppler - pdf2HtmlEX - html 上的文本与源 pdf 不同

1 回答 1

Related

Reference