python - 在 Python 中从 MS Word 文档中提取标题

Question

我有一个MS Word文档包含一些文本和标题，我想提取标题，我为win32安装了Python，但我不知道使用哪种方法，似乎python for windows的帮助文档没有列出功能对象这个词。以下面的代码为例

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

我怎么知道word object的所有功能？我在帮助文档中没有找到任何有用的东西。

score 4 · Accepted Answer

Word 对象模型可以在这里找到。您的doc对象将包含这些属性，并且您可以使用它们来执行所需的操作（请注意，我没有在 Word 中使用此功能，因此我对对象模型的了解很少）。例如，如果您想阅读文档中的所有单词，您可以这样做：

for word in doc.Words:
    print word

你会得到所有的单词。这些word项目中的每一个都是一个Word对象（在此处参考），因此您可以在迭代期间访问这些属性。就您而言，这是获得样式的方法：

for word in doc.Words:
    print word.Style

在具有单个标题 1 和普通文本的示例文档上，将打印：

Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal

要将标题组合在一起，您可以使用itertools.groupby. 正如下面的代码注释中所解释的，您需要引用str()对象本身的，因为 usingword.Style返回的实例不会与相同样式的其他实例正确组合：

from itertools import groupby
import win32com.client as win32

# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument

# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str(). 
# There was some other interesting behavior, but I have zero 
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
  print heading, ''.join(str(word) for word in grp_wrds)

这输出：

Heading 1 Here is some text

Normal 
No header

如果你用join列表推导替换，你会得到下面的（你可以看到换行符）：

Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']

score 3 · Accepted Answer

将 word 转换为 docx 并使用 python docx 模块

from docx import Document

file = 'test.docx'
document = Document(file)

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

score 2 · Accepted Answer

您还可以使用 Google Drive SDK 将 Word 文档转换为更有用的内容，例如 HTML，您可以在其中轻松提取标题。

https://developers.google.com/drive/manage-uploads

python - 在 Python 中从 MS Word 文档中提取标题

3 回答 3

Related

Reference