python - 使用 PDFminer 作为库：“AttributeError: 'NoneType' 对象没有属性 'getobj'”

Question

我正在编写一个用于上传 PDF 文件并在此过程中解析它们的脚本。对于解析，我使用PDFminer。

为了将文件转换为 PDFMiner 文档，我使用以下功能，完全按照上面链接中的说明进行操作：

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    doc.set_parser(parser)
    if self.options['password']:
        password = self.options['password']
    else:
        password = ""
    doc.initialize(password)
    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

预期的结果当然是一个很好的PDFDocument例子，但是我得到了一个错误：

Traceback (most recent call last):
  File "bzk_pdf.py", line 45, in <module>
    cli.run_cli(BZKPDFScraper)
  File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
    instance = cls(options)
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
    self.doc = self.load_document()
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
    doc.set_parser(parser)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'

我不知道去哪里找，我还没有找到其他人有同样的问题。

一些可能有帮助的额外信息：

这是我的测试文件：http ://www.2shared.com/document/kM_wrI3J/testpdf.html
_file是一个django File 对象，但使用普通文件具有相同的结果
pdfminer 版本：'pdfminer-20110515'
Django：1.4.3（我认为这不重要）
Python 2.7.3

score 2 · Accepted Answer

通过一些实验，我发现我错过了一行：

parser.set_document(doc)

添加该行后，该功能现在可以工作了。

对我来说，图书馆设计看起来很糟糕，但可能是我错过了一些东西，这只是弥补了错误。

无论如何，我现在有一个包含我需要的数据的 PDF 文档。

这是最终结果：

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)

    if 'password' in self.options.keys():
        password = self.options['password']
    else:
        password = ""

    doc.initialize(password)

    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

score 0 · Accepted Answer

Try opening the file and sending it to the parser, like this:

with open(_file,'rb') as f:
    parser = PDFParser(f)
    # your normal code here

The way you are doing it now, I suspect you are sending the filename as a string.

python - 使用 PDFminer 作为库：“AttributeError: 'NoneType' 对象没有属性 'getobj'”

2 回答 2

Related

Reference