python - 访问 python-tesseract 的信心

Question

我正在尝试为 python-tesseract 构建一个 OCR 扩展，专门处理具有内部结构的数据表（例如，包含行和列的小计和总计，允许用户通过强制执行结构来提高准确性）。

我正在尝试访问 tesseract 分配给多个结果的置信度（例如，来自无约束运行的所有结果以及所有来自字符限制为的运行的结果[0-9\.]）。

我已经看到了一些有关访问 api 方法的x_wconf属性的信息，GetHOCRText但无法弄清楚如何从 python api 访问它。你如何调用/访问这个值？谢谢！

我在 OSX 10.10.3 和 Python 2.7 上使用 python-tesseract 0.9.1。

score 0 · Accepted Answer

编辑

实际上我错了，我在考虑 pytesseract，而不是 python-tesseract。

如果您查看 API 源代码 (baseapi_mini.h)，您会发现有些功能听起来非常适合您尝试做的事情。您感兴趣的部分大约从第 500 行开始。

  char* GetUTF8Text();

  /**
   * Make a HTML-formatted string with hOCR markup from the internal
   * data structures.
   * page_number is 0-based but will appear in the output as 1-based.
   */
  char* GetHOCRText(int page_number);
  /**
   * The recognized text is returned as a char* which is coded in the same
   * format as a box file used in training. Returned string must be freed with
   * the delete [] operator.
   * Constructs coordinates in the original image - not just the rectangle.
   * page_number is a 0-based page index that will appear in the box file.
   */
  char* GetBoxText(int page_number);
  /**
   * The recognized text is returned as a char* which is coded
   * as UNLV format Latin-1 with specific reject and suspect codes
   * and must be freed with the delete [] operator.
   */
  char* GetUNLVText();
  /** Returns the (average) confidence value between 0 and 100. */
  int MeanTextConf();
  /**
   * Returns all word confidences (between 0 and 100) in an array, terminated
   * by -1.  The calling function must delete [] after use.
   * The number of confidences should correspond to the number of space-
   * delimited words in GetUTF8Text.
   */
  int* AllWordConfidences();

  /**
   * Applies the given word to the adaptive classifier if possible.
   * The word must be SPACE-DELIMITED UTF-8 - l i k e t h i s , so it can
   * tell the boundaries of the graphemes.
   * Assumes that SetImage/SetRectangle have been used to set the image
   * to the given word. The mode arg should be PSM_SINGLE_WORD or
   * PSM_CIRCLE_WORD, as that will be used to control layout analysis.
   * The currently set PageSegMode is preserved.
   * Returns false if adaption was not possible for some reason.
   */

https://bitbucket.org/3togo/python-tesseract/src/9ce0abe168297513d648406be5482b52d38d883b/src/baseapi_mini.h?at=master

我原来的答案

为此，您将不得不编写自己的包装器。

python-tesseract 很好，因为它可以让你快速启动并运行，但这不是我所说的复杂。您可以阅读源代码并了解它是如何工作的，但这里是概要：

将输入图像写入临时文件
在该文件上调用 tesseract 命令（从命令行）
返回结果

因此，如果您想做任何特别的事情，这根本行不通。

我有一个应用程序，我需要高性能和等待文件写入磁盘所花费的时间，等待 tesseract 启动并加载图像并处理它等等太多了。

如果我没记错的话（我无法再访问源代码了）我使用 ctypes 来加载一个 tesseract 进程，设置图像数据，然后调用 GetHOCRText 方法。然后当我需要处理另一个图像时，我不必等待 tesseract 再次加载，我只需设置图像数据并再次调用 GetHOCRText。

因此，这不是您问题的确切解决方案，也绝对不是您可以使用的代码片段。但希望它会帮助您在实现目标方面取得一些进展。

这是关于包装外部库的另一个问题：Wrapping a C library in Python: C, Cython or ctypes?

python - 访问 python-tesseract 的信心

1 回答 1

编辑

我原来的答案

Related

Reference