pdf - Tess4J - 在资源路径中找不到本机库 (linux-x86-64/libtesseract.so)

Question

我正在使用 Tess4J（围绕 tesseract 的 JNA 包装器），并尝试tess.doOCR(myFile)从单页 PDF 调用 OCR 文本。

我安装了 GhostScript（通过使用yum install ghostscript），gs -h工作正常。

我的应用程序服务器正在使用64-bit JVM，我有gsdll64.dll，和 64 位 tesseract dllliblept168.dll和libtesseract302.dll在类路径中。

tess.doOCR(myFile)调用时，将记录以下内容：

GPL Ghostscript 8.70 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

但它只是停在那里。该程序不再进行。

更新 -

看起来真正的问题来自这个错误：

java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path

环顾四周后，我没有找到一个方便的地方来找到这个libtesseract.so文件，而且我不确定如何将这个文件放到我的 Linux 应用服务器上。我读到也许我需要下载一些 C++ 运行时，但我没有看到 Linux 下载。任何建议将不胜感激。

还是这与符号链接有关？

score 3 · Accepted Answer

修复对我来说很简单，只需从命令行执行 sudo apt-get install tesseract-ocr 即可。对于 linux，您无需担心 DDL 库或 jvm 版本。从 apt-get 安装 tessearct 就可以了。

score 0 · Accepted Answer

这些 DLL 适用于 Windows。对于 Linux，您需要从Tesseract 源安装或构建。

那个 GS 版本 8.70 已经很老了。Tess4J 使用的最新 Ghost4J 库与该库不兼容。

score 0 · Accepted Answer

Tess4J should include required libraries. However, you need to extract them first.

This should do the trick:

File tmpFolder = LoadLibs.extractTessResources("win32-x86-64"); // replace platform
System.setProperty("java.library.path", tmpFolder.getPath());

You should replace the argument of extractTessResources(..) with your platform. You can find possible options by looking into the Tess4J jar file.

This way you need not to install Tesseract on your system.

Recently I wrote a blog post about Tess4J in which I used this technique. Maybe it can help if you need further information or a running example project.

score 0 · Accepted Answer

sudo apt-get update
sudo apt-get install tesseract-ocr

通过git下载测试数据

https://github.com/tesseract-ocr/tessdata

pdf - Tess4J - 在资源路径中找不到本机库 (linux-x86-64/libtesseract.so)

4 回答 4

Related

Reference