apache-tika - tika 为带有大量表格的 pdf 返回不正确的文本行

Question

我正在使用 tika 从包含很多表格的 pdf 文件中提取文本。

java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf

它返回一些无效文本，有时它会修剪两个单词之间的空白；例如，它返回“qu inakli fmyathematical idea to the real world”而不是“Link math idea to the real world”。

有没有办法尽量减少这种错误？或者我可以使用另一个库吗？使用 OCR 处理这类 pdf 是否有意义。

score 2 · Accepted Answer

使用 PDFBox 解析器时尝试控制顺序：PDFTextStripper具有控制文档中行顺序的标志。默认情况下（在 PDFBox 中），出于性能原因，它设置为 false（没有保留顺序），但 Tika 在打开和关闭此标志的版本之间改变了它的行为。

在我的博客Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood)中有更多关于这个问题的详细信息。

score 2 · Accepted Answer

为了让 PDF 中的文本以正确的顺序显示，我必须将 SortByPosition 标志设置为 true... (tika-app-1.19.jar)

                    BodyContentHandler handler   = new BodyContentHandler();
                    Metadata           metadata  = new Metadata();
                    ParseContext       context   = new ParseContext();
                    PDFParser          pdfParser = new PDFParser();

                    PDFParserConfig config = pdfParser.getPDFParserConfig();
                    config.setSortByPosition(true); // needed for text in correct order
                    pdfParser.setPDFParserConfig(config);

                    pdfParser.parse(is, handler, metadata, context);

apache-tika - tika 为带有大量表格的 pdf 返回不正确的文本行

2 回答 2

Related

Reference