python - OpenOffice odt 文档、正则表达式和数组

Question

我正在尝试使用约 300 页的 odt 文档。我知道如何在 python 中加载文档，至少以基本方式。这对 odt 不起作用（它不是 txt 文件）。我对此进行了研究并安装了 odfpy 库，尽管它似乎没有很好的文档记录。我能够让它达到我拥有它的数组的程度。但我不知道如何尝试在多个数组条目中使用正则表达式。所以我试着用“str()”把它转换成一个字符串，我得到的只是一长串地址。

我希望能够加载一个 odt 文档并运行一个正则表达式来从中删除某个模式。我该怎么做……？到目前为止，我一直在尝试的方法不起作用。我想保持 odt 的结构完好无损。我更习惯txt。

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
allparas = infile.getElementsByType(text.P)
stringallparas = str(allparas)

这是，到目前为止，我所拥有的，我相信，是成功的。但是某些适用于 .txt 的东西不起作用。

score 0 · Accepted Answer

像下面这样的东西可能会起作用。将 'Your pattern here' 替换为要替换的正则表达式模式。

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
for item in infile.getElementsByType(text.P):
    s = teletype.extractText(item)
    m = re.sub(r'Your pattern here', '', s)
    if m != s:
        new_item = text.P()
        new_item.setAttribute('stylename', item.getAttribute('stylename'))
        new_item.addText(m)
        item.parentNode.insertBefore(new_item, item)
        item.parentNode.removeChild(item)

infile.save('result.odt')

此代码中的 for 循环取自odfpy wiki 上的 ReplaceOneTextToAnother 并稍作修改以使用re.sub代替str.replace和text.P代替text.Span.

python - OpenOffice odt 文档、正则表达式和数组

1 回答 1

Related

Reference