python - 如何从pdf中提取文本框并将其转换为图像

Question

我正在尝试从包含文本的 pdf 中获取裁剪框，这对于为我的一个模型收集训练数据非常有用，这就是我需要它的原因。这是一个pdf样本： https ://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ；例如，我想将第一个 boxtext 作为图像（jpg 或其他）获取，如下所示：

到目前为止我尝试的是以下代码，但我愿意以其他方式解决这个问题，所以如果你有其他方式，那就太好了。此代码是我在此处找到的解决方案（第一个答案）的修改版本如何从 PDF 文件中提取文本和文本坐标？; （只有我的代码的第一部分）；第二部分是我尝试过的，但到目前为止还没有工作，我也尝试用 pymupdf 读取图像，但根本没有改变任何东西（我不会发布这个尝试，因为帖子足够大）。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io

# pdf path 
pdf_path ="example - Git From Bottom Up.pdf"

# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

代码的第二部分，获取图像格式的裁剪框。

# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()

#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)

如您所见，框的坐标与图像不匹配，也许问题是因为 pdf2image 对图像大小或类似的东西做了一些技巧，但我正确指定了图像的大小，所以我不知道。任何解决方案/建议都非常受欢迎。提前致谢。

score 2 · Accepted Answer

我已经检查了代码第一部分中前两个框的坐标，它们或多或少适合页面上的文本：

但是您是否知道 PDF 中的零点位于左下角？也许这是问题的原因。

不幸的是，我没有设法测试代码的第二部分。pdf2image给我一些错误。

但我几乎可以肯定PIL.Image左上角的零点不像PDF。您可以使用公式将 pdf_Y 转换为 pil_Y：

pil_Y = page_height - pdf_Y

在您的情况下，页面高度为 792 pt。您也可以使用脚本获取页面高度。

坐标

更新

尽管如此，在我花了几个小时安装所有模块（这是最难的部分！）之后，我让你的脚本在某种程度上可以工作。

基本上我是对的：坐标是倒置的y => h - y，因为 PIL 和 PDF 的零点位置不同。

还有一件事。PIL 制作分辨率为 200 dpi 的图像（可能可以在某处更改）。PDF 以点为单位测量所有内容（1 pt = 1/72 dpi）。因此，如果您想在 PIL 中使用 PDF 大小，则需要以这种方式更改 PDF 大小：x => x * 200 / 72.

这是固定代码：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
from pathlib import Path # it's just my favorite way to handle files

# pdf path
# pdf_path ="test.pdf"
pdf_path = Path.cwd()/"Git From Bottom Up.pdf"


# PART 1: GET LTBOXES COORDINATES IN THE IMAGE ----------------------
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {
                "startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),
                "endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),
                "text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

# PART 2: NOW GET PAGE TO IMAGE -------------------------------------
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path)[0] # without 'size=...'
#show first page with the right size (at least the one that pdfminer says)
# firstpage_image.show()
firstpage_image.save("firstpage.png")

# the magic numbers
dpi = 200/72
vertical_shift = 5 # I don't know, but it's need to shift a bit
page_height = int(firstpage_size["height"] * dpi)

# loop through boxes (we'll process only first page for now)
for i, _ in enumerate(boxes_data):

    #first box data
    startX, startY, endX, endY, text = boxes_data[i].values()

    # correction PDF --> PIL
    startY = page_height - int(startY * dpi) - vertical_shift
    endY   = page_height - int(endY   * dpi) - vertical_shift
    startX = int(startX * dpi)
    endX   = int(endX   * dpi)
    startY, endY = endY, startY 

    # turn image to array
    image_array = np.array(firstpage_image)
    # get cropped box
    box = image_array[startY:endY,startX:endX,:]
    convert2pil_image = PIL.Image.fromarray(box)
    #show cropped box image
    # convert2pil_image.show()
    png = "crop_" + str(i) + ".png"
    convert2pil_image.save(png)
    #print this does not match with the text, means there's an error
    print(text)

代码几乎和你的一样。我只是添加了坐标校正并保存 PNG 文件而不是显示它们。

输出：

Gi from the bottom up

Wed,  Dec 9

by John Wiegley

In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom
up — rather than look at it only in terms of its high-level commands. And since Git is so beauti-
fully simple when viewed this way, I thought others might be interested to read what I’ve found,
and perhaps avoid the pain I went through nding it.

I used Git version 1.5.4.5 for each of the examples found in this document.

1.  License
2.  Introduction
3.  Repository: Directory content tracking

Introducing the blob
Blobs are stored in trees
How trees are made
e beauty of commits
A commit by any other name…
Branching and the power of rebase
4.  e Index: Meet the middle man

Taking the index farther
5.  To reset, or not to reset

Doing a mixed reset
Doing a so reset
Doing a hard reset

6.  Last links in the chain: Stashing and the reog
7.  Conclusion
8.  Further reading

2
3
5
6
7
8
10
12
15
20
22
24
24
24
25
27
30
31

当然，固定代码更像是原型。不出售。)

python - 如何从pdf中提取文本框并将其转换为图像

1 回答 1

更新

Related

Reference