python - .doc 到 pdf 使用 python

Question

我的任务是将大量 .doc 文件转换为 .pdf。我的主管希望我这样做的唯一方法是通过 MSWord 2010。我知道我应该能够使用 python COM 自动化来自动化它。唯一的问题是我不知道如何以及从哪里开始。我尝试搜索一些教程，但找不到任何教程（也许我可能有，但我不知道我在寻找什么）。

现在我正在阅读这个。不知道这会有多大用处。

score 91 · Accepted Answer

一个使用comtypes的简单示例，转换单个文件，输入和输出文件名作为命令行参数给出：

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

您也可以使用pywin32，除了：

import win32com.client

接着：

word = win32com.client.Dispatch('Word.Application')

score 30 · Accepted Answer

您可以使用docx2pdfpython 包将 docx 批量转换为 pdf。它可以用作 CLI 和 python 库。它需要安装 Microsoft Office，并在 Windows 上使用 COM，在 macOS 上使用 AppleScript (JXA)。

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

pip install docx2pdf
docx2pdf input.docx output.pdf

免责声明：我编写了 docx2pdf 包。https://github.com/AlJohri/docx2pdf

score 17 · Accepted Answer

我在这个问题上工作了半天，所以我想我应该在这件事上分享一些我的经验。史蒂文的回答是正确的，但它会在我的电脑上失败。这里有两个关键点来解决它：

(1)。当我第一次创建“Word.Application”对象时，我应该在打开任何文档之前让它（单词 app）可见。（实际上，我自己也无法解释为什么会这样。如果我不在我的电脑上这样做，当我试图在隐形模型中打开一个文档时程序会崩溃，然后'Word.Application'对象将被删除操作系统。）

(2)。做（1）后，程序有时会运行良好，但可能会经常失败。崩溃错误"COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))"意味着 COM 服务器可能无法如此快速地响应。因此，我在尝试打开文档之前添加了延迟。

完成这两个步骤后，程序将完美运行，不再出现故障。演示代码如下。如果您遇到同样的问题，请尝试按照以下两个步骤操作。希望能帮助到你。

    import os
    import comtypes.client
    import time


    wdFormatPDF = 17


    # absolute path is needed
    # be careful about the slash '\', use '\\' or '/' or raw string r"..."
    in_file=r'absolute path of input docx file 1'
    out_file=r'absolute path of output pdf file 1'

    in_file2=r'absolute path of input docx file 2'
    out_file2=r'absolute path of outputpdf file 2'

    # print out filenames
    print in_file
    print out_file
    print in_file2
    print out_file2


    # create COM object
    word = comtypes.client.CreateObject('Word.Application')
    # key point 1: make word visible before open a new document
    word.Visible = True
    # key point 2: wait for the COM Server to prepare well.
    time.sleep(3)

    # convert docx file 1 to pdf file 1
    doc=word.Documents.Open(in_file) # open docx file 1
    doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 1
    word.Visible = False
    # convert docx file 2 to pdf file 2
    doc = word.Documents.Open(in_file2) # open docx file 2
    doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 2   
    word.Quit() # close Word Application

score 13 · Accepted Answer

我已经测试了许多解决方案，但没有一个能在 Linux 发行版上高效运行。

我推荐这个解决方案：

import sys
import subprocess
import re


def convert_to(folder, source, timeout=None):
    args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]

    process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    filename = re.search('-> (.*?) using filter', process.stdout.decode())

    return filename.group(1)


def libreoffice_exec():
    # TODO: Provide support for more platforms
    if sys.platform == 'darwin':
        return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    return 'libreoffice'

你调用你的函数：

result = convert_to('TEMP Directory',  'Your File', timeout=15)

所有资源：

https://michalzalecki.com/converting-docx-to-pdf-using-python/

score 7 · Accepted Answer

unoconv(writen in python) and openoffice running as a headless daemon. http://dag.wiee.rs/home-made/unoconv/

works very nicely for doc,docx, ppt,pptx, xls, xlsx. Very useful if you need to convert docs or save/convert to certain formats on a server

score 7 · Accepted Answer

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.

doc.ExportAsFixedFormat(OutputFileName=pdf_file,
    ExportFormat=17, #17 = PDF output, 18=XPS output
    OpenAfterExport=False,
    OptimizeFor=0,  #0=Print (higher res), 1=Screen (lower res)
    CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
    DocStructureTags=True
    );

The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

score 4 · Accepted Answer

It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

score 2 · Accepted Answer

如果您不介意使用PowerShell，请查看这个嘿，脚本专家！文章。提供的代码可以采用（见此处）的wdFormatPDF枚举值。这篇博客文章介绍了同一想法的不同实现。WdSaveFormat

score 2 · Accepted Answer

我尝试了接受的答案，但并不特别热衷于 Word 生成的臃肿 PDF，这通常比预期的大一个数量级。在查看了如何在使用虚拟 PDF 打印机时禁用对话框之后，我遇到了 Bullzip PDF 打印机，它的功能给我留下了深刻的印象。它现在取代了我之前使用的其他虚拟打印机。您会在他们的下载页面上找到“免费社区版”。

COM API 可在此处找到，可用设置列表可在此处找到。这些设置被写入一个“runonce”文件，该文件仅用于一个打印作业，然后自动删除。打印多个 PDF 时，我们需要确保在开始另一个打印作业之前完成一个打印作业，以确保为每个文件正确使用设置。

import os, re, time, datetime, win32com.client

def print_to_Bullzip(file):
    util = win32com.client.Dispatch("Bullzip.PDFUtil")
    settings = win32com.client.Dispatch("Bullzip.PDFSettings")
    settings.PrinterName = util.DefaultPrinterName      # make sure we're controlling the right PDF printer

    outputFile = re.sub("\.[^.]+$", ".pdf", file)
    statusFile = re.sub("\.[^.]+$", ".status", file)

    settings.SetValue("Output", outputFile)
    settings.SetValue("ConfirmOverwrite", "no")
    settings.SetValue("ShowSaveAS", "never")
    settings.SetValue("ShowSettings", "never")
    settings.SetValue("ShowPDF", "no")
    settings.SetValue("ShowProgress", "no")
    settings.SetValue("ShowProgressFinished", "no")     # disable balloon tip
    settings.SetValue("StatusFile", statusFile)         # created after print job
    settings.WriteSettings(True)                        # write settings to the runonce.ini
    util.PrintFile(file, util.DefaultPrinterName)       # send to Bullzip virtual printer

    # wait until print job completes before continuing
    # otherwise settings for the next job may not be used
    timestamp = datetime.datetime.now()
    while( (datetime.datetime.now() - timestamp).seconds < 10):
        if os.path.exists(statusFile) and os.path.isfile(statusFile):
            error = util.ReadIniString(statusFile, "Status", "Errors", '')
            if error != "0":
                raise IOError("PDF was created with errors")
            os.remove(statusFile)
            return
        time.sleep(0.1)
    raise IOError("PDF creation timed out")

score 1 · Accepted Answer

您应该从调查所谓的虚拟 PDF 打印驱动程序开始。一旦你找到一个，你应该能够编写将你的 DOC 文件打印成 PDF 文件的批处理文件。您可能也可以在 Python 中执行此操作（设置打印机驱动程序输出并在 MSWord 中发出文档/打印命令，稍后可以使用命令行 AFAIR 完成）。

score 0 · Accepted Answer

I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...

import os
import win32com.client

wdFormatPDF = 17

for root, dirs, files in os.walk(r'your directory here'):
    for f in files:

        if  f.endswith(".doc")  or f.endswith(".odt") or f.endswith(".rtf"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        else:
            pass

The try and except was for those documents I couldn't read and won't exit the code until the last document.

score 0 · Accepted Answer

我也修改了它以支持ppt。我的解决方案支持所有以下指定的扩展。

word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]

我的解决方案：Github 链接

我从Docx2PDF修改了代码

score -8 · Accepted Answer

我建议忽略你的主管并使用具有 Python api 的 OpenOffice。OpenOffice 内置了对 Python 的支持，并且有人为此目的创建了一个库（PyODConverter）。

如果他对输出不满意，请告诉他您可能需要数周时间才能完成。

python - .doc 到 pdf 使用 python

13 回答 13

Related

Reference