pdf - pdf段落或文本位置块

Question

我想检索构成 PDF 页面中的段落和/或文本块的矩形。

我看过 iTextSharp 和 DataLogics。

我能做的最好的就是找到一个单独的单词。但是，我需要知道这些词是否在同一个文本块中。

我正在使用 C#。有人有什么想法吗？

score 1 · Accepted Answer

除非它的结构化 PDF，否则它不会存在。PDF 是一组位于位置的 drawString 命令 - 没有段落或空格标记。您需要从文本位置解决此问题。

score 1 · Accepted Answer

提取页面上每个单词的所有坐标，然后尝试将它们组合在一起。

首先要做的是分组。为此，您需要遍历所有单词和所有顺序单词，并将 y0 小于另一个的 y1，而 y1 大于另一个的 y0 的那些组合在一起。这些是线条。

然后你需要将你的行分组为段落。行首的 x 位置应在另一行页面宽度的 1/25 以内。并且线的y坐标之间的距离应该小于线的高度。这些是段落。

score 0 · Accepted Answer

这是在 Java 中，但它处理从 pdf 获取内容，然后从内容中的索引获取值。

我不确定，但你也许可以在 C# 中实现类似的东西。获取内容并打印出来。

//create a new reader from the source file
PdfReader reader = new PdfReader(fileName);
//create the file array
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(fileName);
//get the content of the pdf reader (which is the source file)
byte bContent [] = reader.getPageContent(1,raf);
ByteArrayOutputStream bs = new ByteArrayOutputStream();
bs.write(bContent);
//create a string of the contents of the page in order to get the data needed
String contentOf1099 = bs.toString();
if(debug)
{
    System.err.println("contentOf1099 = "+contentOf1099);
}
//get the value based off an index
String value = contentOf1099.substring(contentOf1099.indexOf((",contentOf1099.indexOf("155 664 Td"))+1,contentOf1099.indexOf("(",contentOf1099.indexOf("155 664 Td "))+12);

pdf - pdf段落或文本位置块

3 回答 3

Related

Reference