tesseract - 将 Tesseract-OCR 结果与 .uzn 区域匹配

Question

使用 .uzn 文件指定区域时，有没有办法将文本输出与文本来自的 .uzn 区域匹配？

score 1 · Accepted Answer

我找到了一种方法，方法是获取每个字符的边界框，并使用该信息查看每个字符所属的区域。

// Load the .box file output by Tesseract-OCR, and match each bounding box
// to a character from the text. This is not trivial, because whitespaces
// are present in the text from the .txt, but are missing from the .box 
// file.
std::vector<RECT> loadBoxFile(const wchar_t *fileName, const std::wstring &text, int imageHeight)
{
    // Open file.
    std::ifstream st(fileName);
    if (!st.is_open())
        throw std::runtime_error("Could not open .box file.");

    std::string line;
    std::string symbolAscii;
    std::wstring symbol;
    RECT r = { -1, -1, -1, -1 };
    std::vector<RECT> ret;
    ret.resize(text.size(), r);
    size_t textPos = 0;
    while (std::getline(st, line)) {
        // Parse a line.
        std::istringstream iss(line);
        if (!(iss >> symbolAscii >> r.left >> r.top >> r.right >> r.bottom))
            throw std::runtime_error("Could not parse .box file line.");
        symbol = utf8to16(symbolAscii.c_str());

        // We don't try to get the bounding box for '~', because sometimes
        // there is a '~' in .box file that is not there in .txt file. It's
        // a bug in Tesseract-OCR as far as I know. This is a workaround 
        // for that case.
        if (L"~" == symbol)
            continue;

        // Now match the symbol for that line to a symbol in the .txt file.
        textPos = text.find(symbol, textPos);

        // If we couldn't match it then fail.
        if (text.npos == textPos)
            throw std::runtime_error(std::string() + "Could not match symbol \"" + symbolAscii + "\" from .box file to .txt file.");

        // Write the bounding box in the array, at index matching the symbol
        // in the .txt file.
        r.bottom = imageHeight - r.bottom;
        r.top = imageHeight - r.top;
        for (int ii = 0; ii < symbol.size(); ii++)
            ret[textPos + ii] = r;

        // Now increment textPos() so we start searching after the last
        // symbol of the currently found symbol.
        textPos += symbol.size();
    }
    return ret;
}

tesseract - 将 Tesseract-OCR 结果与 .uzn 区域匹配

1 回答 1

Related

Reference