目标:如果 pdf 行包含子字符串,则复制整个句子(跨多行)。
我能够出现print()
在。line
phrase
现在,一旦我找到这个line
,我想返回迭代,直到我找到一个句子终止符:. ! ?
,从上一个句子开始,然后再次向前迭代,直到下一个句子终止符。
这样我就可以知道print()
该短语所属的整个句子。
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1) # previous line
sentence = scrape_sentence(sentence + lines[index+1], lines, index+1) #
following line
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
输出:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
短语:
Responsible Care Company
句子(跨多行):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
我一直在基于这个解决方案进行“回溯”迭代。我确实尝试了 a ,但它不会让你退回迭代。for-loop
如果还有什么我可以添加到帖子中,请告诉我。