0

我正在编写一个用于上传 PDF 文件并在此过程中解析它们的脚本。对于解析,我使用PDFminer

为了将文件转换为 PDFMiner 文档,我使用以下功能,完全按照上面链接中的说明进行操作:

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    doc.set_parser(parser)
    if self.options['password']:
        password = self.options['password']
    else:
        password = ""
    doc.initialize(password)
    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

预期的结果当然是一个很好的PDFDocument例子,但是我得到了一个错误:

Traceback (most recent call last):
  File "bzk_pdf.py", line 45, in <module>
    cli.run_cli(BZKPDFScraper)
  File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
    instance = cls(options)
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
    self.doc = self.load_document()
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
    doc.set_parser(parser)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'

我不知道去哪里找,我还没有找到其他人有同样的问题。

一些可能有帮助的额外信息:

4

2 回答 2

2

通过一些实验,我发现我错过了一行:

parser.set_document(doc)

添加该行后,该功能现在可以工作了。

对我来说,图书馆设计看起来很糟糕,但可能是我错过了一些东西,这只是弥补了错误。

无论如何,我现在有一个包含我需要的数据的 PDF 文档。

这是最终结果:

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)

    if 'password' in self.options.keys():
        password = self.options['password']
    else:
        password = ""

    doc.initialize(password)

    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc
于 2013-02-17T12:33:26.960 回答
0

Try opening the file and sending it to the parser, like this:

with open(_file,'rb') as f:
    parser = PDFParser(f)
    # your normal code here

The way you are doing it now, I suspect you are sending the filename as a string.

于 2013-02-17T09:28:31.550 回答