linux - 用tika1.2配置apache solr3.6

Question

我将 solr3.6 与 tika1.2 一起使用，但我无法上传 pdf 文件。首先，我安装 solr 并从 exampledocs 上传一些 *.xml 文件。我可以用这个 URL 搜索这些文件http://localhost:8983/solr/select/?q=solr。在下一步中，我安装 tika 以上传 pdf 和 doc 文件，但它不起作用。以下内容在“example/solr/conf/solrconf.xml”文件中。

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults"><str name="fmap.content">text</str><str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="tika.config">tika-data-config.xml</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>`

在文件“example/solr/conf/tika-data-config.xml”中我有这个内容：

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource" />
  <document>
    <entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="/home/ubuntu-user/Documents" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip" recursive="true">
      <field column="fileAbsolutePath" name="path" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastmodified" /><entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
      <field column="Author" name="author" meta="true"/>
      <field column="title" name="title" meta="true"/>
    </entity>

如果我把这行放在控制台中

curl http://localhost:8983/solr/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@test.pdf"

我得到这个输出

<?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">183</int>
    </lst>
  </response>

但我无法用 solr 搜索内容。如果我浏览到这个 url: http://localhost:8983/solr/browse，我会看到一个新条目，但没有内容。

我还启动了 solr 和 tika 服务器：

java -jar start.jar
java -jar tika-server-1.2.jar

谁能帮我？

score 1 · Accepted Answer

您需要在 dist 文件夹以及 contrib 文件夹中的相应文件中添加 apache-solr-dataimporthandler-3.6、apache-solr-dataimporthandler-extras-3.6 和 apache-solr-cell-3.6 的 jars（或路径）。

然后，您可以从 Solr 中提取 pdf，而无需启动 Tika 服务器。

score 0 · Accepted Answer

检查ExtractingRequestHandler它将帮助您索引丰富的文档。
您无需启动单独的 Tika Server，因为 Solr 可以使用其中添加的库从丰富的文档中提取内容。

所需的 jar（依赖项所需的 Solr Cell 和 Tika Jars）可能在配置中：-

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" /> 
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

score 0 · Accepted Answer

现在我已经安装了新的 solr，我可以通过这个 url 搜索 pdf

http://localhost:8983/solr/select/?q=attr_content:st*

一些 PDF 还可以，但是通过任何 PDF，我都会得到这个输出

<arr name="attr_content"><str>                         ((stdin))      � ���������

attr_creation_date 和 attr_meta 都可以。制作人是 Ghostscript。GPL 鬼脚本 8.63

linux - 用tika1.2配置apache solr3.6

3 回答 3

Related

Reference