0

尝试从 PyMuPDF 库中遍历页面的每一行以检查句子的长度,如果少于 10 个单词,那么我想添加一个句号。伪代码将是:

#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words 
#add period 

真实代码如下:

import fitz
myfile = "my.pdf"
doc  =fitz.open(myfile)
page=doc[0]
for page in doc:
    text = page.getText("text")
    print(text)

当我添加另一个 for 循环时,例如 for line in page:

我收到一个错误说页面不可迭代。我还有其他方法可以做到这一点吗?

谢谢

4

1 回答 1

0

为了迭代页面行,您可以使用 getDisplayList:

page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
    for line in block['lines']:
        line_text = ''
        for span in line['spans']:
             line_text += ' ' + span['text]
        print(l
于 2021-03-02T16:03:39.633 回答