python - 如何使用python-docx替换Word文档中的文本并保存

Question

同一页面中提到的 oodocx 模块将用户引至似乎不存在的 /examples 文件夹。
我已经阅读了 python-docx 0.7.2 的文档，以及我可以在 Stackoverflow 中找到的关于该主题的所有内容，所以请相信我已经完成了我的“功课”。

Python 是我唯一知道的语言（初学者+，可能是中级），所以请不要假设任何 C、Unix、xml 等知识。

任务：打开一个包含单行文本的 ms-word 2007+ 文档（为了简单起见），并将 Dictionary 中出现在该行文本中的任何“关键”词替换为其字典值。然后关闭文档，保持其他所有内容不变。

一行文字（例如）“我们将在海室中流连。”</p>

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea’: “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

我在文档中没有看到任何允许我执行此操作的内容 - 也许它在那里，但我不明白，因为在我的级别上没有详细说明所有内容。

我已遵循此站点上的其他建议，并尝试使用该模块的早期版本（https://github.com/mikemaccana/python-docx），该版本应该具有“替换，advReplace 等方法”，如下所示：我打开python解释器中的源代码，并在末尾添加以下内容（这是为了避免与已安装的0.7.2版本冲突）：

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None)

运行它会产生以下错误消息：

NameError：名称'coreprops'未定义

也许我正在尝试做一些无法完成的事情——但如果我错过了一些简单的事情，我会很感激你的帮助。

如果这很重要，我在 OSX 10.9.3 上使用 64 位版本的 Enthought's Canopy

score 65 · Accepted Answer

更新：有几个段落级别的函数可以很好地完成这项工作，可以在 GitHub 网站上找到python-docx.

这将用替换 str 替换正则表达式匹配。替换字符串的格式将与匹配字符串的第一个字符相同。
这将隔离运行，以便可以将某些格式应用于该单词或短语，例如突出显示文本中出现的每个“foobar”，或者可能使其变为粗体或以更大的字体显示。

当前版本的 python-docx 没有search()函数或replace()函数。这些请求相当频繁，但一般情况的实现非常棘手，它还没有上升到积压的顶部。

不过，有几个人已经取得了成功，利用现有的设施完成了他们需要的工作。这是一个例子。顺便说一句，它与部分无关:)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

要在表格中搜索，您需要使用类似的东西：

for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if 'sea' in paragraph.text:
                    paragraph.text = paragraph.text.replace("sea", "ocean")

如果你走这条路，你可能很快就会发现其中的复杂性。如果您替换段落的整个文本，则会删除任何字符级格式，例如粗体或斜体的单词或短语。

顺便说一句，@wnnmaw 答案中的代码适用于 python-docx 的旧版本，并且根本不适用于 0.3.0 之后的版本。

score 31 · Accepted Answer

我需要一些东西来替换 docx 中的正则表达式。我接受了scannys的回答。为了处理样式，我使用了以下答案： Python docx 替换段落中的字符串，同时保持样式添加递归调用来处理嵌套表。并想出了这样的事情：

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

遍历字典：

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

请注意，仅当整个正则表达式在文档中具有相同的样式时，此解决方案才会替换正则表达式。

此外，如果在保存相同样式的文本后编辑文本，则文本可能会分开运行。例如，如果您打开包含“testabcd”字符串的文档并将其更改为“test1abcd”并保存，即使面团样式相同，也有 3 次单独运行“test”、“1”和“abcd”，在这种情况下更换 test1 将不起作用。

这是为了跟踪文档中的更改。要将其标记为一次运行，在 Word 中，您需要转到“选项”、“信任中心”并在“隐私选项”中取消选中“存储随机数以提高组合准确性”并保存文档。

score 18 · Accepted Answer

分享我写的一个小脚本 - 帮助我.docx在保留原始样式的同时生成带有变量的法律合同。

pip install python-docx

例子：

from docx import Document
import os


def main():
    template_file_path = 'employment_agreement_template.docx'
    output_file_path = 'result.docx'

    variables = {
        "${EMPLOEE_NAME}": "Example Name",
        "${EMPLOEE_TITLE}": "Software Engineer",
        "${EMPLOEE_ID}": "302929393",
        "${EMPLOEE_ADDRESS}": "דרך השלום מנחם בגין דוגמא",
        "${EMPLOEE_PHONE}": "+972-5056000000",
        "${EMPLOEE_EMAIL}": "example@example.com",
        "${START_DATE}": "03 Jan, 2021",
        "${SALARY}": "10,000",
        "${SALARY_30}": "3,000",
        "${SALARY_70}": "7,000",
    }

    template_document = Document(template_file_path)

    for variable_key, variable_value in variables.items():
        for paragraph in template_document.paragraphs:
            replace_text_in_paragraph(paragraph, variable_key, variable_value)

        for table in template_document.tables:
            for col in table.columns:
                for cell in col.cells:
                    for paragraph in cell.paragraphs:
                        replace_text_in_paragraph(paragraph, variable_key, variable_value)

    template_document.save(output_file_path)


def replace_text_in_paragraph(paragraph, key, value):
    if key in paragraph.text:
        inline = paragraph.runs
        for item in inline:
            if key in item.text:
                item.text = item.text.replace(key, value)


if __name__ == '__main__':
    main()

score 17 · Accepted Answer

我从前面的答案中得到了很多帮助，但对我来说，下面的代码功能就像 word 中的简单查找和替换功能一样。希望这可以帮助。

#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
    for p in doc.paragraphs:
        if p.text.find(i)>=0:
            p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')

上述解决方案有局限性。1）包含“find_this_text”的段落将变成没有任何格式的纯文本，2）与“find_this_text”在同一段落中的上下文控件将被删除，3）上下文控件或表格中的“find_this_text”将被删除不会被改变。

score 2 · Accepted Answer

对于表格案例，我不得不将@scanny 的答案修改为：

for table in doc.tables:
    for col in table.columns:
        for cell in col.cells:
            for p in cell.paragraphs:

让它工作。实际上，这似乎不适用于 API 的当前状态：

for table in document.tables:
    for cell in table.cells:

这里的代码也有同样的问题：https ://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

score 1 · Accepted Answer

Office 开发中心有一个条目，其中开发人员已发布（此时获得 MIT 许可）对几种算法的描述，这些算法似乎提出了解决方案（尽管在 C# 中，并且需要移植）：“ MS 开发中心发布

score 0 · Accepted Answer

第二次尝试的问题是您没有定义savedocx需要的参数。在保存之前，您需要执行以下操作：

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

score 0 · Accepted Answer

他再次更改了 docx py 中的 API ......

为了每个来这里的人的理智：

import datetime
import os
from decimal import Decimal
from typing import NamedTuple

from docx import Document
from docx.document import Document as nDocument


class DocxInvoiceArg(NamedTuple):
  invoice_to: str
  date_from: str
  date_to: str
  project_name: str
  quantity: float
  hourly: int
  currency: str
  bank_details: str


class DocxService():
  tokens = [
    '@INVOICE_TO@',
    '@IDATE_FROM@',
    '@IDATE_TO@',
    '@INVOICE_NR@',
    '@PROJECTNAME@',
    '@QUANTITY@',
    '@HOURLY@',
    '@CURRENCY@',
    '@TOTAL@',
    '@BANK_DETAILS@',
  ]

  def __init__(self, replace_vals: DocxInvoiceArg):
    total = replace_vals.quantity * replace_vals.hourly
    invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
    self.replace_vals = [
      {'search': self.tokens[0], 'replace': replace_vals.invoice_to },
      {'search': self.tokens[1], 'replace': replace_vals.date_from },
      {'search': self.tokens[2], 'replace': replace_vals.date_to },
      {'search': self.tokens[3], 'replace': invoice_nr },
      {'search': self.tokens[4], 'replace': replace_vals.project_name },
      {'search': self.tokens[5], 'replace': replace_vals.quantity },
      {'search': self.tokens[6], 'replace': replace_vals.hourly },
      {'search': self.tokens[7], 'replace': replace_vals.currency },
      {'search': self.tokens[8], 'replace': total },
      {'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
    ]
    self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
    self.doc_path_output = self.doc_path_template + 'output/'
    self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')


  def save(self):
    for p in self.document.paragraphs:
      self._docx_replace_text(p)
    tables = self.document.tables
    self._loop_tables(tables)
    self.document.save(self.doc_path_output + 'testiboi3.docx')

  def _loop_tables(self, tables):
    for table in tables:
      for index, row in enumerate(table.rows):
        for cell in table.row_cells(index):
          if cell.tables:
            self._loop_tables(cell.tables)
          for p in cell.paragraphs:
            self._docx_replace_text(p)

        # for cells in column.
        # for cell in table.columns:

  def _docx_replace_text(self, p):
    print(p.text)
    for el in self.replace_vals:
      if (el['search'] in p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
          print(inline[i].text)
          if el['search'] in inline[i].text:
            text = inline[i].text.replace(el['search'], str(el['replace']))
            inline[i].text = text
        print(p.text)

测试用例：

from django.test import SimpleTestCase
from docx.table import Table, _Rows

from toggleapi.services.DocxService import DocxService, DocxInvoiceArg


class TestDocxService(SimpleTestCase):

  def test_document_read(self):
    ds = DocxService(DocxInvoiceArg(invoice_to="""
    WAW test1
    Multi myfriend
    """,date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
    Paypal to:
    bippo@bippsi.com"""))

    ds.save()

有文件夹 docs 并且 docs/output/ 在同一个文件夹中DocxService.py

例如

一定要参数化和替换东西

score 0 · Accepted Answer

库python-docx-template对此非常有用。非常适合编辑 Word 文档并将其保存回 .docx 格式。

score 0 · Accepted Answer

正如上面的一些其他用户所分享的，其中一个挑战是在 word 文档中查找和替换文本是保留样式，如果单词跨越多个运行，如果单词有多种样式，或者当单词被多次编辑时，就会发生这种情况。文档已创建。因此，假设在一次运行中完全找到一个单词的简单代码通常是不正确的，因此上面共享的基于 python-docx 的代码可能不适用于许多场景。

您可以尝试以下 API

https://rapidapi.com/more.sense.tech@gmail.com/api/document-filter1

这具有处理场景的通用代码。API 目前只处理段落文本，目前不支持表格文本，我会尽快尝试。

python - 如何使用python-docx替换Word文档中的文本并保存

10 回答 10

Related

Reference