python - PDF“高级”信息提取

Question

我正在尝试编写或多或少地解释 PDF 软证明的内容。

我想提取一些信息，但不知道如何提取。

我需要提取的内容：

Bleed:                    I got this somewhat working with pyPdf, given
                          that the document uses 72 dpi, which sadly isn't
                          always the case. I need to be able to calculate
                          the bleed in millimeters.

Print resolution (dpi):   If I read the PDF spec[1] correctly this ought to
                          always be 72 dpi, unless a page has UserUnit set,
                          which was only introduced in PDF-1.6, but shouldn't
                          print documents always be at least 300 dpi? I'm
                          afraid that I misunderstood something…

                          I'd also need the print resolution for images, if
                          they can differ from the default page resolution,
                          that is.

Text color:               I don't have the slightest clue on how to extract
                          this, the string 'text colour' only shows up once
                          in the whole spec, without any explanation how it
                          is set.

Image colormodel:         If I understand it correctly I can read this out
                          in pyPdf with page['/Group']['/CS'] which can be:
                           - /DeviceRGB
                           - /DeviceCMY
                           - /DeviceCMYK
                           - /DeviceGray
                           - /DeviceRGBK
                           - /DeviceN

Font 'embeddedness':      I read in another post on stackoverflow that I
                          can just iterate over the font resources and if a
                          resource has a '/FontFile'-key that means that
                          the font is embedded. Is this correct?

如果 pyPdf 以外的其他库能够更好地提取此信息（或它们的组合），那么它们将受到欢迎。到目前为止，我摸索着使用 pyPdf、pdfrw 和 pdfminer。所有这些都没有最广泛的文档。

[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

score 2 · Accepted Answer

如果我正确阅读了 PDF 规范1，这应该始终为 72 dpi，除非页面设置了 UserUnit，这仅在 PDF-1.6 中引入，但打印文档不应该始终至少为 300 dpi？恐怕我误会了什么……</p>

你确实误解了一些东西。默认用户空间单位默认为 1/72 英寸，但自 PDF-1.6 以来可以在每页基础上更改，它没有定义打印分辨率，它仅定义用户默认给出的坐标单位的长度（即除非任何尺寸变化的转换处于活动状态）对应。

为了打印，所有数据都被转换为一个设备相关空间，其分辨率与用户空间坐标无关。打印分辨率取决于打印设备及其驱动程序；由于安全设置仅允许低质量打印，它们可能会受到限制。

我还需要图像的打印分辨率，如果它们可以不同于默认的页面分辨率，也就是说。

图像（嗯，位图图像，在 PDF 中也有矢量图形）具有各自的分辨率，然后可以在渲染之前进行转换（例如放大）。因此，对于“图像打印分辨率”，您必须检查每个位图图像以及插入它的每个页面内容。如果图像被旋转、倾斜和不对称拉伸，我想知道你将使用什么数字作为分辨率......；）

文本颜色：我对如何提取这一点一无所知，字符串“文本颜色”在整个规范中只出现一次，没有任何解释它是如何设置的。

查看规范中的第 9.2.3 节：

用于绘制字形的颜色应为图形状态下的当前颜色：非描边颜色或描边颜色（或两者），取决于文本渲染模式（参见 9.3.6，“文本渲染模式”）。默认颜色应为黑色（在 DeviceGray 中），但在绘制字形之前，可以通过执行一个或多个适当的颜色设置运算符（参见 8.6.8，“颜色运算符”）来获得其他颜色。

在那里，您可以找到许多指向有趣部分的指针。但是请注意，文本不仅仅是彩色的；它也可以呈现为应用于任何背景的剪辑路径。

我在 stackoverflow 上的另一篇文章中读到，我可以迭代字体资源，如果资源具有“/FontFile”键，则表示字体已嵌入。这个对吗？

我建议进行更精确的分析。还有其他相关的键，例如“/FontFile2”和“/FontFile3”，必须使用正确的键。

不要低估您的任务...您应该开始定义您搜索的属性在旋转、拉伸和倾斜字形、矢量图形和位图图像（如 PDF）的混合环境中的含义。

python - PDF“高级”信息提取

1 回答 1

Related

Reference