pdfbox - Pdfbox PDFTextStripperByArea 坐标偏移

Question

我遇到了坐标问题。PDFTextStripperByArea 区域似乎被推得太高了。

考虑以下示例片段：

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right 
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();

// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region"); 
... 
document.save(...); ...

青色矩形很好地覆盖了所需区域。另一方面，stripper 遗漏了矩形底部的几行，并在矩形上方包含了几行——看起来它“向上”移动（按 y 坐标）。到底是怎么回事？

score 2 · Accepted Answer

正如 Christian 在评论中所说，问题在于 fillRect() 方法的坐标系和 PDFTextStripperByArea 的坐标系不同。

第一个期望原点是页面的左下角，而第二个期望它是左上角。

因此，要使其正常工作，请将赋予 PDFTextStripperByArea 的区域更改为：

Rectangle2D.Float region = new Rectangle2D.Float(x, ph - y - height, width, height);

其中 ph 是页面高度：

float ph = page.getMediaBox().getUpperRightY();

PS：我知道这是一个很老的问题，但是当我遇到同样的问题时，谷歌把我带到了这里，所以我会添加我的答案。

score 1 · Accepted Answer

文本通常包含在定位矩形内。有时，文本不在该矩形内的预期位置，PDFBox 使用该矩形尝试猜测文本的位置。因此，如果文本从捕获区域之外开始并流入其中，则可能不会被提取。

粗略的草图：文本框在捕获区域之外开始，但文本在其中流动。它可能不会被捕获。

____________
|Page      |
|   _______|
|   |Area ||
|   |     ||
| ..|.....||
| ⁞ |Text⁞||
| ⁞ |____⁞||
| ⁞......⁞ |
|__________|

pdfbox - Pdfbox PDFTextStripperByArea 坐标偏移

2 回答 2

Related

Reference