1

我正在运行 PDFBox 提供的示例来获取每个 TextPosition 的宽度/高度。当我通过一页pdf时,它会给我准确的结果。但是,如果我使用多页 pdf,我会得到不正确的高度。

这是我做的实验,我拿了一个 5 页的 pdf 并作为参数传入(每个 TextPosition 的高度错误)。接下来,我使用 MacOSX Preview 将相同的 pdf 拆分为 5 个单页 pdf,并逐页传递(我得到正确的高度)。

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override
    protected void processTextPosition(TextPosition text) {
        System.out.println(" String [x: " + text.getXDirAdj() + ", y: "
            + text.getY() + ", height:" + text.getHeightDir()
            + ", space: " + text.getWidthOfSpace() + ", width: "
            + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
            + text.getCharacter());
    }
}

输出片段 - 5 页 pdf

字符串 [x: 58.500004, y: 692.2, height:33.480003, space: 2.64, width: 6.635998, yScale: 12.0]6

字符串 [x: 58.6, y: 741.2, height:33.480003, space: 2.64, width: 6.6360016, yScale: 12.0]1

字符串 [x: 58.6, y:753.4, height:33.480003, space: 2.64, width: 6.6360016, yScale: 12.0]2

输出狙击手 - 1 页 pdf

字符串 [x: 58.5, y: 692.2, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]6

字符串 [x: 58.6, y: 741.2, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]1

字符串 [x: 58.6, y: 753.4, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]2

有谁知道为什么在这种情况下我们会得到不一致的结果?有什么我缺少的设置吗?

谢谢您的帮助。

这是另一个测试文件 错误高度 pdf - 3 页 ,这里是我得到的输出

字符串 [x: 90.0, y: 83.28003, height:33.480003, space: 5.8497605, width: 7.248001, yScale: 12.0]V

字符串 [x: 97.242, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

字符串 [x: 103.095604, y: 83.28003, height:33.480003, space: 5.8497605, width:4.9680023,yScale:12.0]r

字符串 [x: 108.0588, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.0479965, yScale:12.0]y

字符串[x:116.748,y:83.28003,高度:33.480003,空间:5.8497605,宽度:5.9520035,yScale:12.0]S

字符串 [x: 122.7012, y: 83.28003, height:33.480003, space: 5.8497605, width: 3.3359985, yScale:12.0]i

字符串 [x: 126.034805, y: 83.28003, height:33.480003, space: 5.8497605, width: 9.983994,yScale:12.0]m

字符串 [x: 136.01881, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.671997, yScale:12.0]p

字符串 [x: 142.6932, y: 83.28003, height:33.480003, space: 5.8497605, width: 3.251999, yScale: 12.0]l

字符串 [x: 145.9512, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

字符串 [x: 154.4472, y: 83.28003, height:33.480003, space: 5.8497605, width: 7.9440002, yScale:12.0]D

字符串 [x: 162.38641, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.371994, yScale:12.0]o

String [x: 168.75601, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.2920074, yScale: 12.0]c String [x: 174.0468, y: 83.28003, height:33.480003, width: 5.86240068Scale.86240068 : 12.0]u String [x: 180.6732, y: 83.28003, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m String [x: 190.6572, y: 83.28003, height:33.480003, width: 33.480003, width: 5.8576 : 5.856003, yScale: 12.0]e String [x: 196.5108, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.695999, yScale: 12.0]n String [x: 203.20801, y:334.83.28003, space : 5.8497605, width: 4.0559998, yScale: 12.0]t done processing page 0 done add page 0 String [x: 90.0, y: 139.44, height:33.480003, space: 5.8497605, width: 6.816002, yScale: 12.0]P

字符串 [x: 96.8148, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]a

字符串 [x: 102.6696, y: 139.44, height:33.480003, space: 5.8497605, width: 5.9280014, yScale: 12.0]g

字符串 [x: 108.5964, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

字符串 [x: 117.090004, y: 139.44, height:33.480003, space: 5.8497605, width: 6.6480026, yScale:12.0]2

字符串 [x: 126.375595, y: 139.44, height:33.480003, space: 5.8497605, width: 6.371994, yScale: 12.0]o

字符串 [x: 132.7464, y: 139.44, height:33.480003, space: 5.8497605, width: 3.6360016, yScale: 12.0]f

字符串 [x: 139.0312, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

字符串[x:149.0152,y:139.44,高度:33.480003,空间:5.8497605,宽度:3.3359985,yScale:12.0]i

字符串 [x: 152.3488, y: 139.44, height:33.480003, space: 5.8497605, width: 6.695999, yScale: 12.0]n

字符串 [x: 159.046, y: 139.44, height:33.480003, space: 5.8497605, width: 3.3359985, yScale: 12.0]i

字符串 [x: 162.37961, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

字符串 [x: 172.3636, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]a

字符串 [x: 178.2232, y: 139.44, height:33.480003, space: 5.8497605, width: 3.251999, yScale: 12.0]l

字符串 [x: 181.4812, y: 139.44, height:33.480003, space: 5.8497605, width: 3.3359985, yScale: 12.0]i

字符串 [x: 184.8148, y: 139.44, height:33.480003, space: 5.8497605, width: 5.1600037, yScale: 12.0]s

字符串 [x: 189.9712, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

done processing page 1 done add page 1 String [x: 90.0, y: 266.15997, height:33.480003, space: 5.8497605, width: 6.816002, yScale: 12.0]P

字符串 [x: 96.8148, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003, yScale:12.0]a

字符串 [x: 102.6696, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.9280014,yScale:12.0]g

字符串 [x: 108.5964, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003, yScale:12.0]e

字符串 [x: 117.090004, y: 266.15997, height:33.480003, space: 5.8497605,width:6.6480026,yScale:12.0]3

字符串 [x: 126.375595, y: 266.15997, height:33.480003, space: 5.8497605, width:6.371994,yScale:12.0]o

字符串 [x: 132.7464, y: 266.15997, height:33.480003, space: 5.8497605, width: 7.548004,yScale:12.0]K

字符串 [x: 140.3052, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003,yScale:12.0]a

字符串 [x: 146.16, y: 266.15997, height:33.480003, space: 5.8497605, width: 6.048004, yScale: 12.0]y

字符串 [x: 152.2068, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.0639954,yScale:12.0]?

完成处理第 2 页 完成添加第 2 页

4

2 回答 2

4

在确定已解析字形的高度时(使用相关getFontHeight字体对象的方法),PDFBox 首先检查它是否具有手头单个字形的字体度量。它在这里只知道 AFM type 1 字体度量;由于您的字体是真正的字体,因此 PDFBox 没有此类指标。

在这种情况下,它会继续尝试从字体描述符中检索一般字体度量。文档中字体的字体描述符如下所示:

21 0 obj <<
    /Type /FontDescriptor
    /FontName /GLDXOZ+Cambria
    /Flags 4
    /FontBBox [-1475 -2463 2867 3117]
    /ItalicAngle 0
    /Ascent 950
    /Descent -222
    /CapHeight 667
    /StemV 0
    /XHeight 467
    /AvgWidth 615
    /MaxWidth 2919
    /FontFile2 24 0 R
>>
endobj

它检查的第一个描述符条目是字体边界框(/FontBBox条目),如果存在,它将其高度的一半作为平均字体高度。

在您的情况下,与字体中的字形相比,字体边界框非常大;垂直它从-2463到3117!

另一方面,大写字母高度(/CapHeight条目,扁平大写字母顶部的垂直坐标,从基线测量)仅为 667,而上升(/Ascent字形达到的基线以上的最大高度)在这种字体中;不包括重音字符的字形高度)只有 950。这真的让我想知道为什么该字体有这样的字体边界框......

如果没有字体边界框,PDFBox 接下来会尝试使用大写字母高度,然后是上升,最后是/XHeight - /Descent。这些中的每一个都会产生一个合理的值,但是由于存在那个边界框,PDFBox 假定一个太大的值。

有问题的代码被注释为

// the following values are all more or less accurate
// at least all are average values. Maybe we'll find
// another way to get those value for every single glyph
// in the future if needed

虽然我不知道为什么 PDFBox 更喜欢从边界框猜测平均高度而不是例如上升,但它并不是唯一一个假设您的字体中的文本很大的软件。例如,如果您使用 Adob​​e Acrobat 的文本修饰工具,您会看到:

润色工具在行动

竖线是光标!所以 Acrobat 也认为字体很大。

不幸的是,您没有提供通过使用 MacOSX Preview 拆分从您的示例创建的单页 pdf。因此,我不知道您为什么之后会获得更真实的信息。显然,预览会以某种方式更改字体信息,因为巨大高度值的原因与具有多页或只有一个页面的文档无关。

于 2013-05-16T10:12:39.417 回答
0

在 pdfbox 2.0.24 版中,有两个函数getXScale()getYScale()TextPosition。这可以在渲染中获得真实的尺寸。

于 2021-12-15T10:48:19.697 回答