python - 根据大纲拆分pdf

Question

我想使用 pyPdf 根据大纲拆分 pdf 文件，其中大纲中的每个目标都指的是 pdf 中的不同页面。

示例大纲：

main --> 指向第 1 页
  sect1 --> 指向第 1 页
  sect2 --> 指向第 15 页
  sect3 --> 指向第 22 页

在 pyPdf 中很容易遍历文档的每一页或文档大纲中的每个目标；但是，我不知道如何获取目标指向的页码。

有人知道如何在大纲中找到每个目的地的参考页码吗？

score 9 · Accepted Answer

I figured it out:

class Darrell(pyPdf.PdfFileReader):

    def getDestinationPageNumbers(self):
        def _setup_outline_page_ids(outline, _result=None):
            if _result is None:
                _result = {}
            for obj in outline:
                if isinstance(obj, pyPdf.pdf.Destination):
                    _result[(id(obj), obj.title)] = obj.page.idnum
                elif isinstance(obj, list):
                    _setup_outline_page_ids(obj, _result)
            return _result

        def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
            if _result is None:
                _result = {}
            if pages is None:
                _num_pages = []
                pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
            t = pages["/Type"]
            if t == "/Pages":
                for page in pages["/Kids"]:
                    _result[page.idnum] = len(_num_pages)
                    _setup_page_id_to_num(page.getObject(), _result, _num_pages)
            elif t == "/Page":
                _num_pages.append(1)
            return _result

        outline_page_ids = _setup_outline_page_ids(self.getOutlines())
        page_id_to_page_numbers = _setup_page_id_to_num()

        result = {}
        for (_, title), page_idnum in outline_page_ids.iteritems():
            result[title] = page_id_to_page_numbers.get(page_idnum, '???')
        return result

pdf = Darrell(open(PATH-TO-PDF, 'rb'))
template = '%-5s  %s'
print template % ('page', 'title')
for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
    print template % (p+1,t)

score 1 · Accepted Answer

@darrell 类的小更新能够解析 UTF-8 大纲，我将其发布为答案，因为评论很难阅读。

问题在于pyPdf.pdf.Destination.title可能以两种方式返回：

pyPdf.generic.TextStringObject
pyPdf.generic.ByteStringObject

因此，_setup_outline_page_ids()函数的输出也返回两种不同类型的title对象，UnicodeDecodeError如果大纲标题包含任何内容，则失败，然后是 ASCII。

我添加了这段代码来解决问题：

if isinstance(title, pyPdf.generic.TextStringObject):
    title = title.encode('utf-8')

全班：

class PdfOutline(pyPdf.PdfFileReader):

    def getDestinationPageNumbers(self):

        def _setup_outline_page_ids(outline, _result=None):
            if _result is None:
                _result = {}
            for obj in outline:
                if isinstance(obj, pyPdf.pdf.Destination):
                    _result[(id(obj), obj.title)] = obj.page.idnum
                elif isinstance(obj, list):
                    _setup_outline_page_ids(obj, _result)
            return _result

        def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
            if _result is None:
                _result = {}
            if pages is None:
                _num_pages = []
                pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
            t = pages["/Type"]
            if t == "/Pages":
                for page in pages["/Kids"]:
                    _result[page.idnum] = len(_num_pages)
                    _setup_page_id_to_num(page.getObject(), _result, _num_pages)
            elif t == "/Page":
                _num_pages.append(1)
            return _result

        outline_page_ids = _setup_outline_page_ids(self.getOutlines())
        page_id_to_page_numbers = _setup_page_id_to_num()

        result = {}
        for (_, title), page_idnum in outline_page_ids.iteritems():
            if isinstance(title, pyPdf.generic.TextStringObject):
                title = title.encode('utf-8')
            result[title] = page_id_to_page_numbers.get(page_idnum, '???')
        return result

score 0 · Accepted Answer

Darrell's class can be modified slightly to produce a multi-level table of contents for a pdf (in the manner of pdftoc in the pdftk toolkit.)

My modification adds one more parameter to _setup_page_id_to_num, an integer "level" which defaults to 1. Each invocation increments the level. Instead of storing just the page number in the result, we store the pair of page number and level. Appropriate modifications should be applied when using the returned result.

I am using this to implement the "PDF Hacks" browser-based page-at-a-time document viewer with a sidebar table of contents which reflects LaTeX section, subsection etc bookmarks. I am working on a shared system where pdftk can not be installed but where python is available.

score 0 · Accepted Answer

这正是我一直在寻找的。Darrell 对 PdfFileReader 的添加应该是 PyPDF2 的一部分。

我写了一个小秘诀，使用 PyPDF2 和 sejda-console 通过书签分割 PDF。就我而言，我想将几个 1 级部分放在一起。该脚本允许我这样做并为生成的文件赋予有意义的名称。

import operator
import os
import subprocess
import sys
import time

import PyPDF2 as pyPdf

# need to have sejda-console installed
# change this to point to your installation
sejda = 'C:\\sejda-console-1.0.0.M2\\bin\\sejda-console.bat'

class Darrell(pyPdf.PdfFileReader):
    ...

if __name__ == '__main__':
    t0= time.time()

    # get the name of the file to split as a command line arg
    pdfname = sys.argv[1]

    # open up the pdf
    pdf = Darrell(open(pdfname, 'rb'))

    # build list of (pagenumbers, newFileNames)
    splitlist = [(1,'FrontMatter')] # Customize name of first section

    template = '%-5s  %s'
    print template % ('Page', 'Title')
    print '-'*72
    for t,p in sorted(pdf.getDestinationPageNumbers().iteritems(),
                      key=operator.itemgetter(1)):

        # Customize this to get it to split where you want
        if t.startswith('Chapter') or \
           t.startswith('Preface') or \
           t.startswith('References'):

            print template % (p+1, t)

            # this customizes how files are renamed
            new = t.replace('Chapter ', 'Chapter')\
                   .replace(':  ', '-')\
                   .replace(': ', '-')\
                   .replace(' ', '_')
            splitlist.append((p+1, new))

    # call sejda tools and split document
    call = sejda
    call += ' splitbypages'
    call += ' -f "%s"'%pdfname
    call += ' -o ./'
    call += ' -n '
    call += ' '.join([str(p) for p,t in splitlist[1:]])
    print '\n', call
    subprocess.call(call)
    print '\nsejda-console has completed.\n\n'

    # rename the split files
    for p,t in splitlist:
        old ='./%i_'%p + pdfname
        new = './' + t + '.pdf'
        print 'renaming "%s"\n      to "%s"...'%(old, new),

        try:
            os.remove(new)
        except OSError:
            pass

        try:
            os.rename(old, new)
            print' succeeded.\n'
        except:
            print' failed.\n'

    print '\ndone. Spliting took %.2f seconds'%(time.time() - t0)

python - 根据大纲拆分pdf

4 回答 4

Related

Reference