android - Android OCR 仅使用流行的 tessercat fork tess-two 检测数字

Question

我正在使用流行的 OCR tessercat fork for android tess-two https://github.com/rmtheis/tess-two。我整合了所有的员工，它的工作原理等等......

但我只需要检测数字，我现在的代码是：

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(pathToLngFile, langName);
baseApi.setImage(bitmap);
String recognizedText = baseApi.getUTF8Text();
baseApi.end();
doSomething(recognizedText);

从这里https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits？

我使用的是 V3 版本，没有代码解决方案，而是一些命令行解决方案 - 与 android 项目无关（我认为......）。所以我尝试实现版本 < V3 的解决方案并添加以下行：

baseApi.SetVariable("tessedit_char_whitelist", "0123456789");

我的问题是如何处理 init()？我不需要任何语言，但我仍然需要 init & aint init() 方法......

编辑：更具体地说

我的最终目标是纯文档（不是纯 Excel 表格），看起来像所附图片（标题和 3 列由空格分隔）。

我的要求是使数字有意义：能够分离并确定哪些数字属于哪一行和哪一列。

谢谢，

score 6 · Accepted Answer

我让它有点不同。也许它对某人有用。

所以你需要先初始化API。

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);

然后设置以下变量

baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");

这样，引擎将只检查数字。

score 3 · Accepted Answer

我想做同样的事情，经过一番研究后，我决定捕获所有文本和数字，然后只保留数字，这对我有用：

//This Replaces all except numbers from 0 to 9    
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");

现在你可以对这些数字做任何你想做的事情。

例如，我使用此代码将所有数字分隔到一个字符串数组中，并在 TextView 上显示它们

String[] justnumbers = recognizedText.trim().split(" "); //Deletes blank spaces and splits the numbers
YourTextView.setText(Arrays.toString(justnumbers).replaceAll("\\[|\\]", "")) //sets the numbers into the TextView and deletes the "[]" from the String Array

你可以看到它在这里工作。

希望这可以帮助。

android - Android OCR 仅使用流行的 tessercat fork tess-two 检测数字

2 回答 2

Related

Reference