在帖子底部尝试解决方案。
我有近乎工作的代码,可以跨多行提取包含短语的句子。
但是,有些页面有列。所以各自的输出不正确;其中单独的文本被错误地合并为一个坏句子。
此问题已在以下帖子中得到解决:
问题:
我如何“如果条件”是否有列?
- 页面可能没有列,
- 页面可能有超过 2 列。
- 页面也可能有页眉和页脚(可以省略)。
.pdf
动态文本布局示例: PDF(第 2 页)。
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
import pdfplumber
# ---
def scrape_sentence(phrase, lines, index):
# -- Gather sentence 'phrase' occurs in --
sentence = lines[index]
print("-- sentence --", sentence)
print("len(lines)", len(lines))
# Previous lines
pre_i, flag = index, 0
while flag == 0:
pre_i -= 1
if pre_i <= 0:
break
sentence = lines[pre_i] + sentence
if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# Following lines
post_i, flag = index, 0
while flag == 0:
post_i += 1
if post_i >= len(lines):
break
sentence = sentence + lines[post_i]
if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# -- Extract --
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
print(sentence)
sentence = sentence[0].replace('\n', '').strip() # first occurance
print(sentence)
return sentence
# ---
phrase = 'Gulf Petrochemical Industries Company'
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if phrase in lines[i]:
sentence = scrape_sentence(phrase, lines, i)
i += 1
示例错误输出:
-- sentence -- being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of
len(lines) 47
Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of
Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report, Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[' being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption']
being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption
...
尝试的最小解决方案: 这会将文本分成 2 列;不管有没有2。
# pip install PyPDF2
# pip install pdfplumber
# ---
import pdfplumber
import decimal
# ---
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height))
right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height))
l_text = left.extract_text()
r_text = right.extract_text()
print("\n -- l_text --", l_text)
print("\n -- r_text --", r_text)
text = str(l_text) + " " + str(r_text)
请让我知道是否还有其他需要澄清的地方。