要确定某些标记内容的文本的实际边界框(与某些结构元素布局属性中给出的边界框相比),您可以使用 PDFBoxPDFMarkedContentExtractor
并将其结果与 PDF 结构树数据相结合。
以下代码执行此操作并创建一个输出 PDF,其中确定的边界框包含在彩色矩形中:
PDDocument document = PDDocument.load(SOURCE);
Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();
for (PDPage page : document.getPages()) {
PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
extractor.processPage(page);
Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
markedContents.put(page, theseMarkedContents);
for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
addToMap(theseMarkedContents, markedContent);
}
}
PDStructureNode root = document.getDocumentCatalog().getStructureTreeRoot();
Map<PDPage, PDPageContentStream> visualizations = new HashMap<>();
showStructure(document, root, markedContents, visualizations);
for (PDPageContentStream canvas : visualizations.values())
canvas.close();
document.save(RESULT);
(来自VisualizeMarkedContent方法visualize
)
它使用以下辅助方法通过PDMarkedContent
对象的 MCID 递归映射对象:
void addToMap(Map<Integer, PDMarkedContent> theseMarkedContents, PDMarkedContent markedContent) {
theseMarkedContents.put(markedContent.getMCID(), markedContent);
for (Object object : markedContent.getContents()) {
if (object instanceof PDMarkedContent) {
addToMap(theseMarkedContents, (PDMarkedContent)object);
}
}
}
(VisualizeMarkedContent辅助方法)
该方法showStructure
递归地确定结构元素的边界框,并为每个元素分别绘制一个矩形。boxes
实际上,结构元素可以包含跨页面的内容,因此我们必须在其变量中处理页面到边界框的映射......
Map<PDPage, Rectangle2D> showStructure(PDDocument document, PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents, Map<PDPage, PDPageContentStream> visualizations) throws IOException {
Map<PDPage, Rectangle2D> boxes = null;
PDPage page = null;
if (node instanceof PDStructureElement) {
PDStructureElement element = (PDStructureElement) node;
page = element.getPage();
}
Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page);
for (Object object : node.getKids()) {
if (object instanceof COSArray) {
for (COSBase base : (COSArray) object) {
if (base instanceof COSDictionary) {
boxes = union(boxes, showStructure(document, PDStructureNode.create((COSDictionary) base), markedContents, visualizations));
} else if (base instanceof COSNumber) {
boxes = union(boxes, page, showContent(((COSNumber)base).intValue(), theseMarkedContents));
} else {
System.out.printf("?%s\n", base);
}
}
} else if (object instanceof PDStructureNode) {
boxes = union(boxes, showStructure(document, (PDStructureNode) object, markedContents, visualizations));
} else if (object instanceof Integer) {
boxes = union(boxes, page, showContent((Integer)object, theseMarkedContents));
} else {
System.out.printf("?%s\n", object);
}
}
if (boxes != null) {
Color color = new Color((int)(Math.random() * 256), (int)(Math.random() * 256), (int)(Math.random() * 256));
for (Map.Entry<PDPage, Rectangle2D> entry : boxes.entrySet()) {
page = entry.getKey();
Rectangle2D box = entry.getValue();
if (box == null)
continue;
PDPageContentStream canvas = visualizations.get(page);
if (canvas == null) {
canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true);
visualizations.put(page, canvas);
}
canvas.saveGraphicsState();
canvas.setStrokingColor(color);
canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
canvas.stroke();
canvas.restoreGraphicsState();
}
}
return boxes;
}
(VisualizeMarkedContent方法)
该方法showContent
确定与给定 MCID 关联的文本边界框,如果需要则递归。
Rectangle2D showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) throws IOException {
Rectangle2D box = null;
PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
StringBuilder textContent = new StringBuilder();
for (Object object : contents) {
if (object instanceof TextPosition) {
TextPosition textPosition = (TextPosition)object;
textContent.append(textPosition.getUnicode());
int[] codes = textPosition.getCharacterCodes();
if (codes.length != 1) {
System.out.printf("<!-- text position with unexpected number of codes: %d -->", codes.length);
} else {
box = union(box, calculateGlyphBounds(textPosition.getTextMatrix(), textPosition.getFont(), codes[0]).getBounds2D());
}
} else if (object instanceof PDMarkedContent) {
PDMarkedContent thisMarkedContent = (PDMarkedContent) object;
box = union(box, showContent(thisMarkedContent.getMCID(), theseMarkedContents));
} else {
textContent.append("?" + object);
}
}
return box;
}
(VisualizeMarkedContent方法)
前两种方法showStructure
并showContent
使用以下帮助器来构建边界框的(页面方式)联合:
Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D>... maps) {
Map<PDPage, Rectangle2D> result = null;
for (Map<PDPage, Rectangle2D> map : maps) {
if (map != null) {
if (result != null) {
for (Map.Entry<PDPage, Rectangle2D> entry : map.entrySet()) {
PDPage page = entry.getKey();
Rectangle2D rectangle = union(result.get(page), entry.getValue());
if (rectangle != null)
result.put(page, rectangle);
}
} else {
result = map;
}
}
}
return result;
}
Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D> map, PDPage page, Rectangle2D rectangle) {
if (map == null)
map = new HashMap<>();
map.put(page, union(map.get(page), rectangle));
return map;
}
Rectangle2D union(Rectangle2D... rectangles)
{
Rectangle2D box = null;
for (Rectangle2D rectangle : rectangles) {
if (rectangle != null) {
if (box != null)
box.add(rectangle);
else
box = rectangle;
}
}
return box;
}
(VisualizeMarkedContent辅助方法)
最后,该方法calculateGlyphBounds
从 PDFBox 示例中借用DrawPrintTextLocations
来计算各个字形边界框:
private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
GeneralPath path = null;
AffineTransform at = textRenderingMatrix.createAffineTransform();
at.concatenate(font.getFontMatrix().createAffineTransform());
if (font instanceof PDType3Font)
{
// It is difficult to calculate the real individual glyph bounds for type 3 fonts
// because these are not vector fonts, the content stream could contain almost anything
// that is found in page content streams.
PDType3Font t3Font = (PDType3Font) font;
PDType3CharProc charProc = t3Font.getCharProc(code);
if (charProc != null)
{
BoundingBox fontBBox = t3Font.getBoundingBox();
PDRectangle glyphBBox = charProc.getGlyphBBox();
if (glyphBBox != null)
{
// PDFBOX-3850: glyph bbox could be larger than the font bbox
glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
path = glyphBBox.toGeneralPath();
}
}
}
else if (font instanceof PDVectorFont)
{
PDVectorFont vectorFont = (PDVectorFont) font;
path = vectorFont.getPath(code);
if (font instanceof PDTrueTypeFont)
{
PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
if (font instanceof PDType0Font)
{
PDType0Font t0font = (PDType0Font) font;
if (t0font.getDescendantFont() instanceof PDCIDFontType2)
{
int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
}
}
else if (font instanceof PDSimpleFont)
{
PDSimpleFont simpleFont = (PDSimpleFont) font;
// these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
// which is why PDVectorFont is tried first.
String name = simpleFont.getEncoding().getName(code);
path = simpleFont.getPath(name);
}
else
{
// shouldn't happen, please open issue in JIRA
System.out.println("Unknown font class: " + font.getClass());
}
if (path == null)
{
return null;
}
return at.createTransformedShape(path.getBounds2D());
}
(VisualizeMarkedContent方法)
您的示例文档的结果:
data:image/s3,"s3://crabby-images/9ebc5/9ebc562813e7d609d272938bfa6a7fb093e7e4e9" alt="第 1 页"
data:image/s3,"s3://crabby-images/c146e/c146e1b9653e2ef1e797ef4a198e59f4d53c746a" alt="第2页"
data:image/s3,"s3://crabby-images/b970b/b970b0c39d7e2a3fb8e6d4bf03e4a77dc9588436" alt="第 3 页"