python - 清除 docx 中的新行

Question

我有一个 docx 文件，其中包含很多节之间的新行，当它连续出现不止一次时，我需要清除一个新行。我使用以下方法解压缩文件：

z = zipfile.ZipFile('File.docx','a')
z.extractall()

在目录内部：单词，是一个文件 document.xml，它包含所有数据，但我不知道如何在 xml 中知道新行在哪里。

我知道提取它不是解决方案（我在这里使用只是为了显示文件在哪里）。我想我可以使用：

z.write('Document.xml')

谁能帮我？

score 1 · Accepted Answer

tlewis 的代码用于从 docx 中查找特定文本并替换它。在您的情况下，还有其他事情要做：检测新行，并查看它们是否连续超过两条新行。换句话说，换行符只是一个段落（<w:p>标签），里面没有任何文字。

我添加了一些评论，向您展示如何使用 zip。

import zipfile #Import the zip Module
from lxml import etree #Useful to transform string into xml, and xml into string
templateDocx = zipfile.ZipFile("C:/Template.docx") #Here is the path to the file you want to import
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a") #This is the name of the outputed file

#Open the document.xml file, the file that contains the content 
with open(templateDocx.extract("word/document.xml", "C:/") as tempXmlFile:
    tempXmlStr = tempXmlFile.read()  


tempXmlXml= etree.fromstring(tempXmlStr)   #Convert the string into XML
############
# Algorithm detailled at the bottom, 
# You have to write here the code to select all <w:p> tags, look if there is a <w:t> tag.
############

tempXmlStr = etree.tostring(tempXmlXml, pretty_print=True) # Convert the changed XML into a string

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr) #Write the changed file

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file)) #write all files except the changed ones in the zipArchive

newDocx.write("C:/temp.xml", "word/document.xml") #write the document.xml file

templateDocx.close() #Close both template And new Docx
newDocx.close() # Close

如何编写算法以删除多个新行

这是我创建的示例文档：

多行 Docx

下面是document.xml的对应代码：

 <w:p w:rsidR="006C517B" w:rsidRDefault="00761A87">
         <w:bookmarkStart w:id="0" w:name="_GoBack" />
         <w:bookmarkEnd w:id="0" />
         <w:r>
            <w:t>First Line</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
         <w:proofErr w:type="spellStart" />
         <w:r>
            <w:t>Third</w:t>
         </w:r>
         <w:proofErr w:type="spellEnd" />
         <w:r>
            <w:t xml:space="preserve"> Line</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
         <w:r>
            <w:t>Six Line</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
         <w:proofErr w:type="spellStart" />
         <w:r>
            <w:t>Ten</w:t>
         </w:r>
         <w:proofErr w:type="spellEnd" />
         <w:r>
            <w:t xml:space="preserve"> Line</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
         <w:proofErr w:type="spellStart" />
         <w:r>
            <w:t>Eleven</w:t>
         </w:r>
         <w:proofErr w:type="spellEnd" />
         <w:r>
            <w:t xml:space="preserve"> Line</w:t>
         </w:r>
      </w:p>

如您所见，新行是空的<w:p>，如下所示：

<w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />

要删除多个新行，请检查它们是否为多个 empty <w:p>，并删除除第一行之外的所有行。

希望有帮助！

score -2 · Accepted Answer

从这里：

import zipfile

replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")

with open(templateDocx.extract("word/document.xml", "C:/") as tempXmlFile:
    tempXmlStr = tempXmlFile.read()

for key in replaceText.keys():
    tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key))

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr)

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file))

newDocx.write("C:/temp.xml", "word/document.xml")

templateDocx.close()
newDocx.close()

解释：

步骤 1) 准备一个 Python 字典，其中包含要替换为键的文本字符串和作为项目的新文本（例如 {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}）。

步骤 2) 使用 zipfile 模块打开模板 docx 文件。

步骤 3) 使用附加访问模式打开一个新的新 docx 文件。

步骤 4) 从模板 docx 文件中提取 document.xml（所有文本所在的位置）并将 xml 读取到文本字符串变量中。

步骤 5) 使用 for 循环将 xml 文本字符串中字典中定义的所有文本替换为新文本。

步骤 6) 将 xml 文本字符串写入新的临时 xml 文件。

步骤 7) 使用 for 循环和 zipfile 模块将模板 docx 存档中的所有文件复制到新的 docx 存档中，除了 word/document.xml 文件。

步骤 8) 将带有替换文本的临时 xml 文件作为新的 word/document.xml 文件写入新的 docx 存档。

步骤 9) 关闭您的模板和新的 docx 档案。

第 10 步）打开您的新 docx 文档并享受您替换的文本！

python - 清除 docx 中的新行

2 回答 2

如何编写算法以删除多个新行

Related

Reference