2

I am Looking to index Rich types documents(Pdf, Doc, rtf, txt) into Solr. I found Tika as a solution. I made a rant over the web but didn't found any Docs/links to make it work with ExtractingRequestHandler.

Anyone can please provide step by step way to configure Tika with ExtractingRequestHandler.

Thanks In Advance :)

4

1 回答 1

3

Check ExtractingRequestHandler for Integration of Solr with Tika.
Solr provides tika.config inbuilt and you would not need to define it unless overriding the config.
You can go with the default config as defined in the solrconfig.xml

<!-- Solr Cell Update Request Handler

   http://wiki.apache.org/solr/ExtractingRequestHandler 

-->
<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

You can use the commands to index the files to solr with additional metadata.

curl "http://localhost:8983/solr/update/extract?literal.id=2&literal.title=Test&commit=true&fmap.content=text" -F "myfile=@1.pdf"

By default the content of the files are copied to content field and copied over to text, you can override the settings.

于 2013-07-15T05:09:31.767 回答