我正在编写一个用于上传 PDF 文件并在此过程中解析它们的脚本。对于解析,我使用PDFminer。
为了将文件转换为 PDFMiner 文档,我使用以下功能,完全按照上面链接中的说明进行操作:
def load_document(self, _file = None):
"""turn the file into a PDFMiner document"""
if _file == None:
_file = self.options['file']
parser = PDFParser(_file)
doc = PDFDocument()
doc.set_parser(parser)
if self.options['password']:
password = self.options['password']
else:
password = ""
doc.initialize(password)
if not doc.is_extractable:
raise ValueError("PDF text extraction not allowed")
return doc
预期的结果当然是一个很好的PDFDocument
例子,但是我得到了一个错误:
Traceback (most recent call last):
File "bzk_pdf.py", line 45, in <module>
cli.run_cli(BZKPDFScraper)
File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
instance = cls(options)
File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
self.doc = self.load_document()
File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
doc.set_parser(parser)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
self.info.append(dict_value(trailer['Info']))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
x = resolve1(x)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
x = x.resolve()
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'
我不知道去哪里找,我还没有找到其他人有同样的问题。
一些可能有帮助的额外信息:
- 这是我的测试文件:http ://www.2shared.com/document/kM_wrI3J/testpdf.html
_file
是一个django File 对象,但使用普通文件具有相同的结果- pdfminer 版本:'pdfminer-20110515'
- Django:1.4.3(我认为这不重要)
- Python 2.7.3