python - pyPdf：非法的 UTF-16 代理

Question

我有一个破坏 pyPdf 的 pdf 文件：http: //tovotu.de/tests/test.pdf

这是示例脚本：

from pyPdf import PdfFileWriter, PdfFileReader

outputPdf = PdfFileWriter()

inpdf = open("test.pdf","rb")
inputPdf = PdfFileReader(inpdf)
[outputPdf.addPage(x) for x in inputPdf.pages]

with open("output.pdf","wb") as outpdf:
    outputPdf.write(outpdf)

错误输出在这里： http: //pastebin.com/0m38zhjQ

使用来自 GitHub 的 PyPDF2 时，错误是相同的。pdftk 可以像处理任何其他 pdf 一样处理此 pdf。请注意，写作失败，但阅读似乎工作得很好！

您至少可以指出导致该错误的 pdf 的确切部分吗？解决方法会更好:)

score 1 · Accepted Answer

看起来像 PyPDF2 中的一个错误。在本节中：

if string.startswith(codecs.BOM_UTF16_BE):
    retval = TextStringObject(string.decode("utf-16"))
    retval.autodetect_utf16 = True

它假定任何以 (0xFE, 0xFF) 开头的字符串都可以解码为 UTF-16。您的文件包含一个以这种方式开始但随后包含无效 UTF-16 的字节串。

最简单的解决方法是注释掉它if并无条件地使用# This is probably a big performance hit here分支。

python - pyPdf：非法的 UTF-16 代理

1 回答 1

Related

Reference