0

我无法理解解析器是如何加载到 Tika 中的。从他们的文档来看,Tika-app 似乎与解析器一起预先打包(https://tika.apache.org/1.17/gettingstarted.html)。当我运行这个命令来启动服务器时

    ./.java-buildpack/open_jdk_jre/bin/java -jar ./lib/tika-app-1.24.1.jar -s --port ${PORT}

    2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR for optional dependencies.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR WARNING: org.xerial's sqlite-jdbc is not loaded.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Please provide the jar on your classpath to parse sqlite files.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR See tika-parsers/pom.xml for the correct version.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] OUT Successfully started tika-app's server on port: 8080
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR WARNING: The server option in tika-app is deprecated and will be removed
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR by Tika 2.0 if not shortly after Tika 1.14.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR Please migrate to the JAX-RS tika-server package.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR See https://wiki.apache.org/tika/TikaJAXRS for usage.
   2020-11-02T13:31:25.66-0600 [HEALTH/0] ERR Failed to make HTTP request to '/version' on port 8080: timed out after 1.00 seconds
   2020-11-02T13:31:25.66-0600 [CELL/0] ERR Timed out after 1m0s: health check never passed.

我有最新的 tika 版本 1.24.1。他们的文档提到下载 tika-server 并在运行时传递类路径以指向 tika-parsers.jar ( https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-ParsersMissing ) 但我可以在任何地方都找不到parsers.jar 文件。我正在使用 openjdk-jre-1.8.0 来运行它。

4

1 回答 1

0

默认情况下应该捆绑解析器。服务器模式下的 Tika App (-s) 是基于套接字的服务器。您可以通过使用 netcat 并查看是否收到响应来确认它是否正常工作:

nc localhost 8080 -q2 < test.pdf

要在 Python 中使用它,您需要编写自定义代码打开一个套接字并发送输入,发送一个 SHUT_WR,然后读回输出。

如果您使用的是 tika-python 库,则期望使用 tika-server JAR 中的Tika 服务器,而不是 tika-app JAR。它有一些帮助设置,因此您可以指向 JAR,或者您可以托管自己的实例(自运行或 docker)并为其提供 URL。

于 2020-11-10T09:00:10.957 回答