php - 使用 Apache Solr 索引 pdf 文件内容

Question

我正在使用 Solr 的php 扩展来与 Apache Solr 交互。我正在索引数据库中的数据。我也想索引外部文件（如 PDF、PPTX）的内容。

索引的逻辑是：假设schema.xml定义了以下字段：

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="created" type="tlong" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="filepath" type="text_general" indexed="false" stored="true"/>
<field name="filecontent" type="text_general" indexed="false" stored="true"/>

单个数据库条目可能/可能没有存储文件。

因此，以下是我的索引代码：

$post = stdclass object having the database content
$doc = new SolrInputDocument();
$doc->addField('id', $post->id);
$doc->addField('name', $post->name);
....
....
$res = $client->addDocument($doc);
$client->commit();

接下来，我想将PDF文件的内容添加到与上面相同的solr文档中。

这是curl代码：

$ch = curl_init('
http://localhost:8010/solr/update/extract?');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

但是，我想我错过了一些东西。我阅读了文档，但无法找到检索文件内容然后将其添加到现有 solr 文档中的方法field: filecontent

编辑#1：如果我尝试literal.id=xyz在 curl 请求中设置，它会创建一个新的 solr 文档，其中包含id=xyz. 我不想创建新的 solr 文档。我希望 pdf 的内容被索引并存储为先前创建的 solr 文档中的一个字段。

$doc = new SolrInputDocument();//Solr document is created
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765`
....
....
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

我希望上面的 solr 文档（id = 98765）有一个字段，其中 pdf 的内容被索引和存储。

但是 cURL 请求（如上）创建了另一个新文档（带有id = 1）。我不想要那个。

score 2 · Accepted Answer

Solr 与 Apache Tika 执行提取富文档内容并将其添加回 Solr 文档的处理。

文档：-

您可能会注意到，尽管您可以搜索示例文档中的任何文本，但在检索文档时您可能无法看到该文本。这仅仅是因为 Tika 生成的“内容”字段映射到名为“文本”的 Solr 字段，该字段被索引但不存储。这是通过 solrconfig.xml 中 /update/extract 处理程序中的默认映射规则完成的，并且可以轻松更改或覆盖。例如，要存储和查看所有元数据和内容，请执行以下命令：

默认 schema.xml：-

<!-- Main body of document extracted by SolrCell.
    NOTE: This field is not indexed by default, since it is also copied to "text"
    using copyField below. This is to save space. Use this field for returning and
    highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

如果您要定义不同的属性来维护文件内容，请覆盖fmap.content=filecontentsolrconfig.xml 本身中的默认值。

fmap.content=attr_content 参数覆盖默认的 fmap.content=text 导致内容被添加到 attr_content 字段。

如果您想在单个文档中对其进行索引，请使用文字前缀，例如literal.id=1&literal.name=Name属性

$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

php - 使用 Apache Solr 索引 pdf 文件内容

1 回答 1

Related

Reference