solr - HTML 解析器在 SOLR 3.6 中不起作用

Question

将 solr.jar 与 Apache Solr 3.6 下载中的示例一起使用，HTML 标记不会被剥离。

在 schema.xml 我添加了以下内容：

<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
</fieldType>

<field name="title" type="text_html" indexed="true" stored="true" multiValued="true"/>

另外，我将以下 JSON 发布到 SOLR：

[
{
    "id" : "978-064172344522",

    "title":"my <a href=\"www.foo.bar\">link</a>  power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth"

}

]

重启 SOLR 后，我搜索了 power-shot，结果仍然显示 HTML 标签

 <result name="response" numFound="1" start="0" maxScore="0.13561106">
 <doc>
 <float name="score">0.13561106</float>
 <str name="id">978-064172344522</str>
 <arr name="title">
 <str>my <a href="www.foo.bar">link</a> power-shot PowerShot USC Utility <br>hello</br>

这里缺少什么？

score 2 · Accepted Answer

您看到的是存储为最初发送到 Solr 的字段。例如，如果您搜索“title:href”，则不应找到该文档，因为应删除分析器链中的 html 内容

solr - HTML 解析器在 SOLR 3.6 中不起作用

1 回答 1

Related

Reference