parsing - Excluding super script when extracting text from pdf

Question

I have extracted text from pdf line by line using pdfbox, to process it with my algorithm by sentences.

I am recognizing the sentences by using period(.) followed by a word whose first letter is capital. Here the issue is, when a sentence ends with a word which has superscript, extractor treats it as a normal character and places it next to period(.)

For example: expression "2 power 22" when appeared as a last word in a sentence i.e. with a period, it has been extracted as 2.22 which makes it difficult to identify the end of sentence.

Please suggest a solution to get rid of super script or a different logic to identify the end of sentence.

Thanks.

score 1 · Accepted Answer

我正在回答我自己的问题，因为有些问题可能会在这里得到指导。

我根据@mkl 的建议解决了这个问题。观察 PDFStreamEngine.java 中 getYScale() 的结果后，得出上标大小小于 8.9663 的结论。所以我在创建TextPosition之前在PDFStreamEngine.java中保留了一个条件，它将由PDFTextStripper.java处理。代码如下：

if(textXctm.getYScale()>=8.9663) {
    processTextPosition(
        new TextPosition(
            pageRotation,
            pageWidth,
            pageHeight,
            textMatrixStart,
            endXPosition,
            endYPosition,
            totalVerticalDisplacementDisp,
            widthText,
            spaceWidthDisp,
            c,
            codePoints,
            font,
            fontSizeText,
            (int)(fontSizeText * textMatrix.getXScale())
    ));
}

让我知道我的方法在仅消除上标方面是否有任何缺陷。谢谢。

parsing - Excluding super script when extracting text from pdf

1 回答 1

Related

Reference