python - 在 Colaboratory 中维护数据类型

Question

我正在尝试使用 PyPDF2 读取 pdf 文档并输出纯文本字符串。但是，当我使用代码将我的 pdf 文件上传到 colaboratory 时：

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
  name=fn, length=len(uploaded[fn])))

它会自动将其转换为 str 类型，而不是将其保留为编码字符串。这会导致 PyPDF.PdfFileReader() 出现错误，但如果您打印字符串，它仍然包含所有编码字符：

gsutilCheatSheet.pdf => %PDF-1.5 %�� 1 0 obj <>/Metadata 117 0 R/ViewerPreferences 118 0 R>> endobj

等等

有什么方法可以将导入的文档保持为原始编码格式，或者一旦它已经是 str，是否有另一种方法可以删除编码？

score 0 · Accepted Answer

我怀疑您需要将上传的文件包装在io.BytesIO.

这是一个完整的示例，展示了如何使用 PyPDF2 打开上传的 PDF - https://colab.research.google.com/notebook#fileId=1XlmXcp4xnrUGMUArevxiGNlrbMOMECO1

关键是：

pdf = PdfFileReader(io.BytesIO(uploaded['abc123.pdf']))

python - 在 Colaboratory 中维护数据类型

1 回答 1

Related

Reference