java - How to detect text blocks and columns in pdf with tess4j

Question

I'm new to Tesseract (tess4j), managed to used main features like reading the text or getting the words positions both from image or pdf, rotating etc..

I can't find, and not sure if it is possible to easily detect blocks of text (paragraphs or columns). Also, if there are some other blocks in pdf like images or something else, is it possible to get it somehow, or at least to get the position of the block (box).

score 2 · Accepted Answer

您可以使用TessBaseAPIGetComponentImages API 方法，如下：

Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, TRUE, null, null);

检查 Tess4J单元测试以获取完整示例。

score 1 · Accepted Answer

我已经接受了答案，但这是该答案的结果：

public Page recognizeTextBlocks(Path path) {
        log.info("TessBaseAPIGetComponentImages");
        File image = new File(path.toString());
        Leptonica leptInstance = Leptonica.INSTANCE;
        Pix pix = leptInstance.pixRead(image.getPath());
        Page blocks = new Page(pix.w,pix.h);        
        api.TessBaseAPIInit3(handle, datapath, language);
        api.TessBaseAPISetImage2(handle, pix);
        api.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
        PointerByReference pixa = null;
        PointerByReference blockids = null;
        Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, FALSE, pixa, blockids);
        int boxCount = leptInstance.boxaGetCount(boxes);
        for (int i = 0; i < boxCount; i++) {
            Box box = leptInstance.boxaGetBox(boxes, i, L_CLONE);
            if (box == null) {
                continue;
            }
            api.TessBaseAPISetRectangle(handle, box.x, box.y, box.w, box.h);
            Pointer utf8Text = api.TessBaseAPIGetUTF8Text(handle);
            String ocrResult = utf8Text.getString(0);
            Block block = null;
            if(ocrResult == null || (ocrResult.replace("\n", "").replace(" ","")).length() == 0){
                block = new ImageBlock(new Rectangle(box.x, box.y, box.w, box.h));
            }else{
                block = new TextBlock(new Rectangle(box.x, box.y, box.w, box.h), ocrResult); 
            }
            blocks.add(block);
            api.TessDeleteText(utf8Text);
            int conf = api.TessBaseAPIMeanTextConf(handle);
            log.debug(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box.x, box.y, box.w, box.h, conf, ocrResult));
        }

        //release Pix resource
        PointerByReference pRef = new PointerByReference();
        pRef.setValue(pix.getPointer());
        leptInstance.pixDestroy(pRef);

        return blocks;
    }

注意：类 Block、ImageBlock 和 TextBlock 来自我的项目，不是 tess4j 或 tesseract 的一部分

java - How to detect text blocks and columns in pdf with tess4j

2 回答 2

Related

Reference