python - 内置支持 doc、docx 和 pdf 文件的文本索引器（用于 python）

Question

我目前正在为我的 python 程序寻找文本索引器。我入围了 Solr，一个 Lucene 项目和 Whoosh，它是 Python 原生的。我搜索了很多关于对 doc、docx 和 pdf 文件的支持的文档，Solr 一直将我指向 Tika 包，它的一个版本与 Solr 集成。

结果并没有在某些方面提及是否有任何包具有对这三种格式的内置支持。Whoosh 和 Solr 是否支持他们？还有哪些其他开源索引器本机读取这些格式？

score 3 · Accepted Answer

With Solr 1.4 or later you can have Word and PDF files uploaded and indexed on the fly; see: http://wiki.apache.org/solr/ExtractingRequestHandler

Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

1 回答 1