“solr-cell”的相关标签问题

0 投票

1 回答

2768 浏览

pdf - 使用 Solr 用页码索引 PDF

我正在使用 ExtractingRequestHandler 使用 Solr 为 PDF 编制索引。我想在文档中显示页码以及命中，例如“在第 2、3 和 5 页foo找到术语”。bar.pdf

是否可以像这样在查询结果中包含页码？

2010-11-04T06:05:15.957

0 投票

1 回答

4223 浏览

solr - 使用 /solr/update 进行索引时如何提升 SOLR 文档

为了索引我的网站，我有一个 Ruby 脚本，它会生成一个 shell 脚本，将我的文档根目录中的每个文件上传到 Solr。shell 脚本有很多行，如下所示：

...并以：

这会将我的文档根目录中的所有文档上传到 Solr。我使用tika 和 ExtractingRequestHandler将各种格式的文档（主要是 PDF 和 HTML）上传到 Solr。

在生成这个 shell 脚本的脚本中，我想根据它们的 id 字段（a/k/a url）是否匹配某些正则表达式来提升某些文档。

假设这些是提升规则（伪代码）：

将索引时间提升添加到我的 http 请求的最简单方法是什么？

我试过：

和：

搜索结果的顺序都没有区别。我想要的是提升结果在搜索结果中排在首位，无论用户搜索什么（当然前提是文档包含他们的查询）。

我知道，如果我以 XML 格式发布，我可以为整个文档或特定字段指定提升值。但是如果我这样做，则不清楚如何将文件指定为文档内容。实际上，tika 页面提供了一个部分示例：

但同样不清楚在哪里/如何指定我的提升。我试过：

和

两者都没有改变搜索结果。

Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps: 1) Upload/index document as I have been doing 2) Specify boost for certain documents

solr apache-tika solr-cell

2011-02-09T02:24:10.320

0 投票

1 回答

5454 浏览

java - 如何用 SolrJ 索引 pdf 的内容？

我正在尝试使用 SolrJ 索引一些 pdf 文档，如http://wiki.apache.org/solr/ContentStreamUpdateRequestExample所述，下面是代码：

不幸的是，在查询 *:* 时，我得到了索引文档列表，但内容字段为空。如何更改上面的代码以提取文档的内容？

下面是描述该文档的 xml frament ：

我不认为这个问题与 Apache Tika 的错误安装有关，因为以前我有一些 ServerException 但现在我已经在正确的路径中安装了所需的 jar。此外，我尝试使用同一类索引 txt 文件，但attr_content字段始终为空。

java solr solr-cell

2011-04-17T13:06:44.267

0 投票

1 回答

1943 浏览

solr - 如何使用 Solr 3.1 配置 Tika 0.9

你能给我用 Solr 3.1 配置 Tika 0.9 的步骤吗

我在 solrconfig.xml 中使用的这个来配置请帮助我

谢谢，

solr apache-tika solr-cell

2011-04-20T06:36:02.317

0 投票

1 回答

3168 浏览

solr - tika solr 集成

我正在尝试使用基于 curl 的请求进行索引

请求是

在提交请求时，我收到此错误，

solr full-text-search apache-tika solr-cell

2011-05-31T11:28:52.167

0 投票

1 回答

785 浏览

solr - Solr Cell / ExtractingRequestHandler cannot parse some *.doc files

I need to index content of doc/docx/pdf files uploaded by users and use Solr (1.4.1) ExtractingRequestHandler component (817165) for that. If that matters, I don't request indexing from it - the component is always called with extractOnly parameter returning text content of the document only and not adding it to the index on its own straight away (the content is then added to the index "outside" as a text field of the document following the standard procedure).

However, some files are not parsed and the component returns 500 Internal Server Error with no other details provided. Of all *.doc files submitted by our users about 30% of them fail to parse.

It is not the problem with Solr load - the files that cannot be parsed are always the same if you parse the same list of them again and again. It is also not about their size - many of them are smaller than other ones parsed successfully. Apparently, it is not about peculiar formatting (or at least that is not obvious) - almost all documents that fail to parse have coloured fonts, tables and images but many of the ones parsed successfully also have the same.

All these files open in Word without any warnings or errors. If you save them as docx Solr starts parsing them correctly but re-saving them in the same doc format with the same content doesn't help. Still, if all the content is removed and replaced by some lorem ipsum text, then saved as doc, they become correct.

As the content replacing helps, it should be something with some elements used in the documents but there is no description on Tika Formats page telling in which cases parsing of the document fails.

I've uploaded a sample file which fails to be parsed in case if anyone is curious enough to try it (it is archived to prevent Windows Live from converting it into "online document").

Currently as a way around I use an ancient antiword utility to parse those *.doc on which Solr fails (and antiword parses them perfectly). Still, it is obviously a crutch and I wonder if anybody else is facing the same issue - I failed to google it so probably that's me doing something wrong.

Or, if that's a known problem, what could be more elegant ways to solve it (I don't like relying on antiword)?

solr ms-word doc apache-tika solr-cell

2011-06-16T08:45:15.713

0 投票

1 回答

3468 浏览