python - python-pptx 从幻灯片标题中提取文本

Question

我正在 python 中构建一个文档检索引擎，它返回根据用户提交的查询的相关性排序的文档。我有一组文档，其中还包括 PowerPoint 文件。对于 PPT，在结果页面上，我想向用户显示前几张幻灯片的标题，以便为他/她提供更清晰的图片（有点像我们在 Google 搜索中看到的）。

所以基本上，我想使用python从PPT文件中提取幻灯片标题中的文本。我为此使用了python-pptx包。目前我的实现看起来像这样

from pptx import Presentation
prs = Presentation(filepath) # load the ppt
slide_titles = [] # container foe slide titles
for slide in prs.slides: # iterate over each slide
        title_shape =  slide.shapes[0] # consider the zeroth indexed shape as the title
        if title_shape.has_text_frame: # is this shape has textframe attribute true then
            # check if the slide title already exists in the slide_title container
            if title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ' not in slide_titles: 
                slide_titles.append(title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ')

但正如您所看到的，我假设每张幻灯片上的零索引形状是幻灯片标题，显然并非每次都如此。关于如何做到这一点的任何想法？

提前致谢。

score 2 · Accepted Answer

local_pptxFileList = ["abc.pptx"]


for i in local_pptxFileList:
            ppt = Presentation(i)
            for slide in ppt.slides:
                for shape in slide.shapes:
                    if shape.has_text_frame:
                        print(shape.text)

score 2 · Accepted Answer

Slide.shapes（一个SlideShapes对象）有一个属性.title，当有一个（通常是）时返回标题形状，如果没有标题，则返回 None。
http://python-pptx.readthedocs.io/en/latest/api/shapes.html#slideshapes-objects

这是访问标题形状的首选方式。

请注意，并非所有幻灯片都有标题形状，因此您必须测试None结果以避免在这种情况下出现错误。

另请注意，用户有时会为标题使用不同的形状，例如他们添加的单独的新文本框。因此，您不能保证在幻灯片中获得“出现”为标题的文本。但是，您将获得与 PowerPoint 认为的标题相匹配的文本，例如，它在大纲视图中显示为该幻灯片标题的文本。

prs = Presentation(path)
for slide in prs.slides:
    title_shape = slide.title
    if title_shape is None:
        continue
    print(title_shape.text)

score 0 · Accepted Answer

如何从目录中的 pptx 中提取所有文本（来自此博客）

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)

python - python-pptx 从幻灯片标题中提取文本

3 回答 3

Related

Reference