python - 使用 PDF Plumber 时列出超出范围的索引

Question

您好，我正在使用 pdf 管道工从 PDF 中提取文本并将其写入文本文件，但出现索引超出范围错误。

import glob
import pdfplumber



for filename in glob.glob('*.pdf'):
    pdf = pdfplumber.open(filename)
    OutputFile = filename.replace('.pdf','.txt')
    fx2=open(OutputFile, "a+")
    for i in range(0,10000,1):
        try:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)
            fx2.write(text)
        except Exception as e:
            print(e)
    fx2.close()
    pdf.close() ````

score 0 · Accepted Answer

试试这个代码：

filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page in enumerate(pdf.pages):
        my_width = page.width
        my_height = page.height
        # Crop pages
        my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
        page_crop = page.crop(bbox=my_bbox)
        text = text+str(page_crop.extract_text())
        pages.append(page_crop)

crop_coords是用于裁剪页面的列表。下面是坐标的解释：

x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.

如果您不想执行此操作，只需使用以下代码：

filename = 'path/to/your/PDF'
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page in enumerate(pdf.pages):
        text = text+str(page.extract_text())
        pages.append(page)

在这两种情况下，结果将是：

text: 包含所有 PDF 文本的字符串
pages：一个列表，其中每个元素都是对象页面。你可以访问它的属性，看这里

python - 使用 PDF Plumber 时列出超出范围的索引

1 回答 1

Related

Reference