pdf - 尝试提取文本时出现pyPDF2 TypeError

Question

我已经成功安装了pyPDF，但是extractText方法效果不好，所以我决定试试pyPDF2，问题是，提取文本的时候出现异常：

Traceback (most recent call last):
  File "C:\Users\Asus\Desktop\pfdtest.py", line 44, in <module>
    test2()
  File "C:\Users\Asus\Desktop\pfdtest.py", line 41, in test2
    print(mypdf.getPage(0).extractText())
  File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1701, in extractText
    content = ContentStream(content, self.pdf)
  File "C:\Python32\lib\site-packages\PyPDF2\pdf.py", line 1783, in __init__
    stream = StringIO(stream.getData())
TypeError: initial_value must be str or None, not bytes

这是我的示例代码：

filename = "myfile.pdf"
f = open(filename,'rb')
mypdf = PdfFileReader(f)
print(f,mypdf,mypdf.getNumPages())
print(mypdf.getPage(0).extractText())

它正确地确定了 pdf 中的页数，但是在读取流时存在问题。

score 1 · Accepted Answer

这是与 PyPDF2 和 Python 3 的兼容性有关的问题。

就我而言，我已经通过将pdf.py和替换utils.py为您将在此处找到的那些来解决它，它们基本上控制您是否正在运行 Python 3，并且如果您正在运行，则以字节而不是字符串的形式接收数据。

pdf - 尝试提取文本时出现pyPDF2 TypeError

1 回答 1

Related

Reference