java - ContentExtraction of PDF file in solr using Apache Tika

Question

I am trying to index the PDF file in the solr using the following tutorial http://wiki.apache.org/solr/ExtractingRequestHandler But everytime i am firing the command

java -jar post.jar *.pdf

it says some org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 Error Kindly help me in indexing the PDF to solr server.Is there any other integration then tika which can help me.

score 3 · Accepted Answer

Post.jar 只是一个将文件上传到 Solr 的实用程序。
Solr 使用 Extract 处理程序，因此您需要提供为 url。例如

java -Durl=http://localhost:8983/solr/update/extract?literal.id=1 -Dtype=application/pdf -jar post.jar 1.pdf

对于加密文件检查链接
对于受密码保护的文件检查链接

score 0 · Accepted Answer

这里显然存在一些编码问题。

我记得几个月前做过类似的事情，如果你可以编写自己的 Java 代码，那是相当容易的。这些大多写起来很简单，而且它们就像一个魅力！

java - ContentExtraction of PDF file in solr using Apache Tika

2 回答 2

Related

Reference