python - Python 3 从网络解析 PDF

Question

我试图从网页获取 PDF，对其进行解析并使用PyPDF2将结果打印到屏幕上。我使用以下代码使其正常工作：

with open("foo.pdf", "wb") as f:
    f.write(requests.get(buildurl(jornal, date, page)).content)
pdfFileObj = open('foo.pdf', "rb")
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

编写一个文件以便我可以阅读它，虽然听起来很浪费，所以我想我只是用这个切断中间人：

pdf_reader = PyPDF2.PdfFileReader(requests.get(buildurl(jornal, date, page)).content)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

然而，这给我一个AttributeError: 'bytes' object has no attribute 'seek'. 如何将 PDFrequests直接来自 PyPDF2？

score 8 · Accepted Answer

您必须使用以下方法将返回的对象转换content为类似文件的对象BytesIO：

import io

pdf_content = io.BytesIO(requests.get(buildurl(jornal, date, page)).content)
pdf_reader = PyPDF2.PdfFileReader(pdf_content)

score 3 · Accepted Answer

使用 io 来伪造文件的使用（Python 3）：

import io

output = io.BytesIO()
output.write(requests.get(buildurl(jornal, date, page)).content)
output.seek(0)
pdf_reader = PyPDF2.PdfFileReader(output)

我没有在您的上下文中进行测试，但我测试了这个简单的示例并且它有效：

import io

output = io.BytesIO()
output.write(bytes("hello world","ascii"))
output.seek(0)
print(output.read())

产量：

b'hello world'

python - Python 3 从网络解析 PDF

2 回答 2

Related

Reference