python - 从pdf中提取矩形中的文本 - Python

Question

我需要从 Pdf 中提取矩形中的文本。我测试了几种方法。但没有得到具体的文字。例如，我使用 PyMuPDF、pdfplumber、tabula、camelot、pdftables 包进行了测试。在 PyMuPDF 模块中，它要求输入开头和结尾的词来提取文本。据我了解，剩余的包也只是提取线条、曲线信息而不是文本。

我想在不提供任何开始和结束文本的情况下从 PDF 中的矩形获取文本。

https://drive.google.com/file/d/1wCvik7VbEvDwbT-mapgXc8fwlq7Ao3BP/view?usp=sharing

score 0 · Accepted Answer

您可以使用下面的代码

import PyPDF2
def convert_pdf_to_text (document):
    read_pdf = PyPDF2.PdfFileReader(document, strict=False)
    number_of_pages = read_pdf.getNumPages()

    alltext1=""
    for page_number in range(number_of_pages):
        page = read_pdf.getPage(page_number)
        alltext1 += page.extractText()
    return alltext1.replace("\n", "")
convert_pdf_to_text ('pdf_test.pdf')

输出

'A Simple PDF File  This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...  Details  State: State_name     City: City_name    Country: Country_name     Rig No: 4455555  Source Id: k4-3k44 '

python - 从pdf中提取矩形中的文本 - Python

1 回答 1

Related

Reference