我正在尝试从包含文本的 pdf 中获取裁剪框,这对于为我的一个模型收集训练数据非常有用,这就是我需要它的原因。这是一个pdf样本: https ://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ;例如,我想将第一个 boxtext 作为图像(jpg 或其他)获取,如下所示:
到目前为止我尝试的是以下代码,但我愿意以其他方式解决这个问题,所以如果你有其他方式,那就太好了。此代码是我在此处找到的解决方案(第一个答案)的修改版本如何从 PDF 文件中提取文本和文本坐标?; (只有我的代码的第一部分);第二部分是我尝试过的,但到目前为止还没有工作,我也尝试用 pymupdf 读取图像,但根本没有改变任何东西(我不会发布这个尝试,因为帖子足够大)。
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
# pdf path
pdf_path ="example - Git From Bottom Up.pdf"
# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# here is where i stored the data
boxes_data = []
page_sizes = []
def parse_obj(lt_objs, verbose = 0):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
if verbose >0:
print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
boxes_data.append(data_dict)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)
mediabox = page.mediabox
mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
page_sizes.append(mediabox_data)
代码的第二部分,获取图像格式的裁剪框。
# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()
#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)
如您所见,框的坐标与图像不匹配,也许问题是因为 pdf2image 对图像大小或类似的东西做了一些技巧,但我正确指定了图像的大小,所以我不知道。任何解决方案/建议都非常受欢迎。提前致谢。