java - 通过 PDFBox 访问图像的“替代文本”

Question

有没有办法使用PDFBox为特定图像提取“替代文本” ？

我有一个 PDF 文件，如http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1所述，在图像中添加了替代文本。使用 PDFBox，我可以通过 PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() 通过对象模型找到图像本身（PDXObjectImage）的方法，但我看不到任何从图像中获取的方法本身到它的替代文本。

可以在http://dl.dropbox.com/u/12253279/image_test_pass.pdf找到一个小的示例 PDF（带有指定了一些替代文本的单个图像）（它应该说“这是图像的替代文本。”）。

score 2 · Accepted Answer

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.

Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).

Assuming you have access to the internal structure of the PDF file, you will need to:

1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.

Page content:

BT
/P <</MCID 0 >>BDC 
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC 
ET
/Figure <</MCID 1 >>BDC 
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC

Your image: enter image description here

2- In the page object, retrieve the key StructParents. enter image description here

3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.

4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.

enter image description here

Looks easy, isn't it?

Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

score 0 · Accepted Answer

PDFBox 邮件列表中的 Eric 给我发了以下内容，虽然我还没有测试过...

你好，

对于您的测试文件，这是一种访问“/Alt”条目的方法：

    PDDocument document = PDDocument.load("image_test_pass.pdf");
    PDStructureTreeRoot treeRoot =
        document.getDocumentCatalog().getStructureTreeRoot();

    // get page for each StructElement
    for (Object o : treeRoot.getKids()) {
        if (o instanceof PDStructureElement) {
            PDStructureElement structElement = (PDStructureElement)o;
            System.out.println(structElement.getAlternateDescription());
            PDPage page = structElement.getPage();
            if (page != null) {
                page.getResources().getImages();
            }
        }
    }

请参阅 PDF 规范http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf尤其是 §14.6、§14.7、§14.9.3 和 §14.9.4 以了解所有规则找到“/ Alt”条目。似乎有几种方法可以定义此信息。

BR，埃里克

java - 通过 PDFBox 访问图像的“替代文本”

2 回答 2

Related

Reference