试试这个代码:
filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page in enumerate(pdf.pages):
my_width = page.width
my_height = page.height
# Crop pages
my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
page_crop = page.crop(bbox=my_bbox)
text = text+str(page_crop.extract_text())
pages.append(page_crop)
crop_coords
是用于裁剪页面的列表。下面是坐标的解释:
x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.
如果您不想执行此操作,只需使用以下代码:
filename = 'path/to/your/PDF'
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page in enumerate(pdf.pages):
text = text+str(page.extract_text())
pages.append(page)
在这两种情况下,结果将是:
text
: 包含所有 PDF 文本的字符串
pages
:一个列表,其中每个元素都是对象页面。你可以访问它的属性,看这里