python - pdf管道工| 从动态列布局中提取文本

Question

在帖子底部尝试解决方案。

我有近乎工作的代码，可以跨多行提取包含短语的句子。

但是，有些页面有列。所以各自的输出不正确；其中单独的文本被错误地合并为一个坏句子。

此问题已在以下帖子中得到解决：

问题：

我如何“如果条件”是否有列？

页面可能没有列，
页面可能有超过 2 列。
页面也可能有页眉和页脚（可以省略）。

.pdf动态文本布局示例： PDF（第 2 页）。

Jupyter 笔记本：

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber

# ---

def scrape_sentence(phrase, lines, index):
    # -- Gather sentence 'phrase' occurs in --
    sentence = lines[index]
    print("-- sentence --", sentence)
    print("len(lines)", len(lines))
    
    # Previous lines
    pre_i, flag = index, 0
    while flag == 0:
        pre_i -= 1
        if pre_i <= 0:
            break
            
        sentence = lines[pre_i] + sentence
        
        if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or '  •  ' in lines[pre_i]:
            flag == 1
    
    print("\n", sentence)
    
    # Following lines
    post_i, flag = index, 0
    while flag == 0:
        post_i += 1
        if post_i >= len(lines):
            break
            
        sentence = sentence + lines[post_i] 
        
        if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or '  •  ' in lines[pre_i]:
            flag == 1 
    
    print("\n", sentence)
    
    # -- Extract --
    sentence = sentence.replace('!', '.')
    sentence = sentence.replace('?', '.')
    sentence = sentence.split('.')
    sentence = [s for s in sentence if phrase in s]
    print(sentence)
    sentence = sentence[0].replace('\n', '').strip()  # first occurance
    print(sentence)
    
    return sentence

# ---

phrase = 'Gulf Petrochemical Industries Company'

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        if text == None:
            continue
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if phrase in lines[i]:
                sentence = scrape_sentence(phrase, lines, i)
            i += 1

示例错误输出：

-- sentence -- being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 
len(lines) 47

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report,  Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf  Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is  GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain  and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum  always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[' being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption']
being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption

...

尝试的最小解决方案： 这会将文本分成 2 列；不管有没有2。

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber
import decimal

# ---

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height))
        right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height))
        
        l_text = left.extract_text()
        r_text = right.extract_text()
        print("\n -- l_text --", l_text)
        print("\n -- r_text --", r_text)
        text = str(l_text) + " " + str(r_text)

请让我知道是否还有其他需要澄清的地方。

score 1 · Accepted Answer

此答案使您能够按预期顺序抓取文本。

迈向数据科学文章Python 中的 PDF 文本提取：

与 PyPDF2 相比，PDFMiner 的范围要有限得多，它只专注于从 pdf 文件的源信息中提取文本。

from io import StringIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_string(file_path):
    output_string = StringIO()
    with open(file_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    return(output_string.getvalue())

file_path = ''  # !
text = convert_pdf_to_string(file_path)
print(text)

之后可以进行清洁。

python - pdf管道工| 从动态列布局中提取文本

1 回答 1

Related

Reference