python - 为 Python 安装 Scraperwiki 会生成错误 pdftohtml not found

Question

我一直在尝试为 Python 安装 Scraperwiki 模块。但是，它会产生错误：

""用户警告：本地 Scraperlibs 需要 pdftohtml，但在 PATH 中找不到 pdftohtml。您可能需要安装它”。

我查看了 poppler，因为他们有 pdftohtml 文件，但我不知道它是如何工作的 - 是否需要安装 python 库或 .exe 文件。以及我该如何安装它。在 Windows 上运行。

非常感谢

score 0 · Accepted Answer

如果您不打算使用scraperwiki.pdftoxml()，则警告不适用。但是，它不会阻止您安装scraperwiki软件包。

此外，该功能根本无法在 Windows 上运行。它使用which在 Windows 和 Linux 上的NamedTemporaryFiles行为不同。

如果您确实想使用该功能，在 Windows 上获取最新版本的最简单方法pdftohtml是下载Calibre Portable。（Sourceforge 上的版本较旧。）

安装在任何地方；您只需要其中的几个文件。从您安装它的位置，从包含 calibre.exe 的文件夹，您需要pdftohtml.exe进入您的工作文件夹，以及从DLLsCalibre 安装中的文件夹，freetype.dll, jpeg.dll, libpng12.dll, zlib1.dll。

您还需要基于代码scraperwiki.pdftoxml()，例如：

def pdftoxml(pdfdata, options):
    """converts pdf file to xml file"""
    # lots of hacky Windows fixes c.f. original
    with open('input.pdf', 'wb') as f:
    f.write(pdfdata)
    cmd = 'pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes '
    if options:
        cmd += options
    cmd += 'input.pdf output.xml'
    cmd = cmd + " > NUL 2>&1"
    os.system(cmd)
    with open('output.xml', 'r') as f:
    return f.read()

（我最近试图让这个在 Windows 中的用户工作；我会保持更新包含此代码的要点。）

python - 为 Python 安装 Scraperwiki 会生成错误 pdftohtml not found

1 回答 1

Related

Reference