python - 使用 PDFMiner 解析没有 /Root 对象的 PDF

Question

我正在尝试使用 PDFMiner python 绑定从大量 PDF 中提取文本。我编写的模块适用于许多 PDF，但是对于 PDF 的子集，我得到了这个有点神秘的错误：

ipython 堆栈跟踪：

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

当然，我立即检查了这些 PDF 是否已损坏，但它们可以正常阅读。

尽管没有根对象，有没有办法阅读这些 PDF？我不太确定从这里去哪里。

非常感谢！

编辑：

我尝试使用 PyPDF 来获得一些差异诊断。堆栈跟踪如下：

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux 建议 PDFMiner 在到达第一个 EOF 字符后停止解析。这似乎暗示了其他情况，但我非常无能为力。有什么想法吗？

score 6 · Accepted Answer

slate pdf 中的解决方案是使用 'rb' --> 读取二进制模式。

因为 slate pdf 取决于 PDFMiner，我也有同样的问题，这应该可以解决你的问题。

fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc

score 5 · Accepted Answer

有趣的问题。我进行了某种研究：

解析pdf的函数（来自矿工源代码）：

def set_parser(self, parser):
        "Set the document to use a given PDFParser object."
        if self._parser: return
        self._parser = parser
        # Retrieve the information of each header that was appended
        # (maybe multiple times) at the end of the document.
        self.xrefs = parser.read_xref()
        for xref in self.xrefs:
            trailer = xref.get_trailer()
            if not trailer: continue
            # If there's an encryption info, remember it.
            if 'Encrypt' in trailer:
                #assert not self.encryption
                self.encryption = (list_value(trailer['ID']),
                                   dict_value(trailer['Encrypt']))
            if 'Info' in trailer:
                self.info.append(dict_value(trailer['Info']))
            if 'Root' in trailer:
                #  Every PDF file must have exactly one /Root dictionary.
                self.catalog = dict_value(trailer['Root'])
                break
        else:
            raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
        if self.catalog.get('Type') is not LITERAL_CATALOG:
            if STRICT:
                raise PDFSyntaxError('Catalog not found!')
        return

如果您对 EOF 有问题，将引发另一个异常：'''来自源的另一个函数'''

def load(self, parser, debug=0):
        while 1:
            try:
                (pos, line) = parser.nextline()
                if not line.strip(): continue
            except PSEOF:
                raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
            if not line:
                raise PDFNoValidXRef('Premature eof: %r' % parser)
            if line.startswith('trailer'):
                parser.seek(pos)
                break
            f = line.strip().split(' ')
            if len(f) != 2:
                raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
            try:
                (start, nobjs) = map(long, f)
            except ValueError:
                raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
            for objid in xrange(start, start+nobjs):
                try:
                    (_, line) = parser.nextline()
                except PSEOF:
                    raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                f = line.strip().split(' ')
                if len(f) != 3:
                    raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
                (pos, genno, use) = f
                if use != 'n': continue
                self.offsets[objid] = (int(genno), long(pos))
        if 1 <= debug:
            print >>sys.stderr, 'xref objects:', self.offsets
        self.load_trailer(parser)
        return

来自 wiki(pdf specs)：PDF 文件主要由对象组成，其中有八种类型：

Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object

对象可以是直接的（嵌入到另一个对象中）或间接的。间接对象用对象编号和世代编号编号。称为外部参照表的索引表给出了每个间接对象相对于文件开头的字节偏移量。这种设计允许对文件中的对象进行有效的随机访问，并且还允许在不重写整个文件的情况下进行小的更改（增量更新）。从 PDF 版本 1.5 开始，间接对象也可能位于称为对象流的特殊流中。此技术可减小包含大量小型间接对象的文件的大小，并且对于标记 PDF 尤其有用。

我认为问题是您的“损坏的 pdf”在页面上有一些“根元素”。

Possible solution:

您可以在检索到外部参照对象以及解析器尝试解析这些对象的每个位置下载源代码并编写“打印函数”。将有可能确定完整的错误堆栈（在出现此错误之前）。

ps：我认为这是产品中的某种错误。

score 1 · Accepted Answer

我在 Ubuntu 中也遇到过同样的问题。我有一个非常简单的解决方案。只需将 pdf 文件打印为 pdf。如果您在 Ubuntu 中：

使用 (ubuntu) 文档查看器打开一个 pdf 文件。
转到文件
转到打印
选择打印为文件并勾选“pdf”

如果您想使该过程自动化，请遵循例如this，即使用此脚本自动打印您的所有 pdf 文件。像这样的 linux 脚本也可以：

for f in *.pdfx
do
lowriter --headless --convert-to pdf "$f"
done

请注意，我将原始（有问题的）pdf 文件称为 pdfx。

score 0 · Accepted Answer

我也遇到了这个错误，并一直在尝试 fp = open('example','rb')

但是，我仍然收到指示的错误 OP。我发现我的代码中有错误，PDF 仍然被另一个函数打开。
因此，请确保您没有在其他地方的内存中打开 PDF。

score -1 · Accepted Answer

楼上的一个答案是对的。此错误仅出现在 windows 中，解决方法是替换 with open(path, 'rb') 为 fp = open(path,'rb')

python - 使用 PDFMiner 解析没有 /Root 对象的 PDF

5 回答 5

Related

Reference