0

我正在尝试从 pdf url 中提取文本。如果我下载 PDF,我可以使用 函数轻松提取文本slate。但是,当尝试导入 pdfio并提取文本时,返回的输出什么都没有。下面附上的代码。

import requests, PyPDF2, io
from io import BytesIO

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as data:
    read_pdf = PyPDF2.PdfFileReader(data)
    page = read_pdf.getPage(1)
    print(page.extractText())

我尝试了许多其他功能,但都无法正常工作。难道我做错了什么?

4

1 回答 1

0

它也给了我空白输出。我不确定为什么。但是您是否尝试过使用pdfminer3。它给了我正确的文本输出。以下代码为我提供了文件的正确输出。

import requests
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

也看看这篇文章 How to use PDFminer.six with python 3? .

于 2021-02-27T21:22:23.370 回答