python - 为什么我的代码没有正确拆分扫描的 pdf 中的每一页？

Question

更新：感谢 stardt 的脚本有效！pdf是另一个页面的页面。我在另一个上尝试了脚本，它也正确地吐出了每个pdf页面，但是页码的顺序有时是对的，有时是错误的。比如pdf文件的第25-28页，打印的页码是14、15、17、16。我想知道为什么？整个pdf可以从http://download304.mediafire.com/u6ewhjt77lzg/bgf8uzvxatckycn/3.pdf下载

原件：我有一个扫描的 pdf，其中两个纸页并排放置在一个 pdf 页面中。我想将 pdf 页面分成两部分，原来的左半部分成为两个新 pdf 页面中较早的部分。pdf 看起来像在此处输入图像描述 .

这是受Gillesun2up启发命名的 Python 脚本：

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight

    p.mediaBox.upperLeft = (0, h/2)
    p.mediaBox.upperRight = (w, h/2)
    p.mediaBox.lowerRight = (w, 0)
    p.mediaBox.lowerLeft = (0, 0)

    q.mediaBox.upperLeft = (0, h)
    q.mediaBox.upperRight = (w, h)
    q.mediaBox.lowerRight = (w, h/2)
    q.mediaBox.lowerLeft = (0, h/2)

    output.addPage(q)
    output.addPage(p)
output.write(sys.stdout)

我在终端中的 pdf 上尝试了该脚本，命令为un2up < page.pdf > out.pdf，但输出out.pdf未正确拆分。

我还检查了变量的值w和h的输出p.mediaBox.upperRight，根据它们的实际比例，它们是514和1224不正确的。

该文件可以从http://download851.mediafire.com/bdr4sv7v5nzg/raci13ct5w4c86j/page.pdf下载。

score 7 · Accepted Answer

您的代码假定p.mediaBox.lowerLeft是 (0,0) 但实际上是 (0, 497)

这适用于您提供的文件：

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for i in range(input.getNumPages()):
    p = input.getPage(i)
    q = copy.copy(p)

    bl = p.mediaBox.lowerLeft
    ur = p.mediaBox.upperRight

    print >> sys.stderr, 'splitting page',i
    print >> sys.stderr, '\tlowerLeft:',p.mediaBox.lowerLeft
    print >> sys.stderr, '\tupperRight:',p.mediaBox.upperRight

    p.mediaBox.upperRight = (ur[0], (bl[1]+ur[1])/2)
    p.mediaBox.lowerLeft = bl

    q.mediaBox.upperRight = ur
    q.mediaBox.lowerLeft = (bl[0], (bl[1]+ur[1])/2)
    if i%2==0:
        output.addPage(q)
        output.addPage(p)
    else:
        output.addPage(p)
        output.addPage(q)

output.write(sys.stdout)

score 1 · Accepted Answer

@stardt 的代码非常有用，但是我在拆分一批具有不同方向的 pdf 文件时遇到了问题。这是一个更通用的功能，无论页面方向如何，它都可以工作：

import copy
import math
import pyPdf

def split_pages(src, dst):
    src_f = file(src, 'r+b')
    dst_f = file(dst, 'w+b')

    input = pyPdf.PdfFileReader(src_f)
    output = pyPdf.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i)
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.mediaBox.lowerLeft
        x3, x4 = p.mediaBox.upperRight

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)
        x5, x6 = math.floor(x3/2), math.floor(x4/2)

        if x3 > x4:
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical
            p.mediaBox.upperRight = (x3, x4)
            p.mediaBox.lowerLeft = (x1, x6)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

score 0 · Accepted Answer

我想补充一点，您必须注意您的mediaBox变量不会在副本p和q. p.mediaBox如果您在获取副本之前阅读，这很容易发生。

在这种情况下，写入 egp.mediaBox.upperRight可能会修改q.mediaBox，反之亦然。

@moraes 的解决方案通过显式复制 mediaBox 来解决这个问题。

python - 为什么我的代码没有正确拆分扫描的 pdf 中的每一页？

3 回答 3

Related

Reference