solr - 如何在 SOLR 中索引 .html 文件

Question

我要做索引的文件存储在服务器上（我不需要爬）。/path/to/files/ 示例 HTML 文件是

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>

</p>
</p>

我在 solrconfing.xml 文件中添加了请求处理程序。

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

我的 data-config.xml 看起来像这样

<dataConfig>
<dataSource type="FileDataSource" />
<document>
    <entity name="f" processor="FileListEntityProcessor" baseDir="/path/to html/files/" fileName=".*html" recursive="true" rootEntity="false" dataSource="null">
        <field column="plainText" name="text"/>
    </entity>
</document>
</dataConfig>

我保留了默认的 schema.xml 文件并将以下代码添加到 schema.xml 文件中。

 <field name="product_id" type="string" indexed="true" stored="true"/>
 <field name="assetid" type="string" indexed="true" stored="true" required="true" />
 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="type" type="string" indexed="true" stored="true"/>
 <field name="category" type="string" indexed="true" stored="true"/>
 <field name="first" type="text_general" indexed="true" stored="true"/>

 <uniqueKey>assetid</uniqueKey>

当我在设置后尝试进行完全导入时，它显示所有 html 文件都已获取。但是当我在 SOLR 中搜索时，它没有显示任何结果。任何人都知道可能是什么原因？

我的理解是所有文件都正确提取但未在 SOLR 中编制索引。有谁知道如何在 SOLR 中索引 HTML 文件的元标记和内容？

您的回复将不胜感激。

score 5 · Accepted Answer

您可以使用Solr 提取请求处理程序向 Solr 提供 HTML 文件并从 html 文件中提取内容。例如在链接

Solr 使用Apache Tika从上传的 html 文件中提取内容

如果您想抓取网站并将其编入索引，带有 Solr 的 Nutch 是一个更广泛的解决方案。
Nutch 与 Solr 教程将帮助您入门。

score 0 · Accepted Answer

您的意思是在 data-config.xml 中有 fileName="*.html" 吗？你现在有 fileName=".*html"

我很确定 Solr 不会知道如何将您的元字段从您的 html 转换为索引字段。我没试过。

但是，我创建了读取 (x)html 的程序（使用 xpath）。这将创建一个格式化的 xml 文件以发送到 \update。此时，您应该可以使用 dataimporthandler 来查找格式化的 xml 文件。

score 0 · Accepted Answer

这是一个将 HTML 转换为文本并提取相关元数据的完整示例：

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNull;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.junit.Test;

import java.io.ByteArrayInputStream;

public class ConversionTest {

    @Test
    public void testHtmlToTextConversion() throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(("<html>\n" +
            "<head>\n" +
            "<title> \n" +
            " A Simple HTML Document\n" +
            "</title>\n" +
            "</head>\n" +
            "<body></div>\n" +
            "<p>This is a very simple HTML document</p>\n" +
            "<p>It only has two paragraphs</p>\n" +
            "</body>\n" +
            "</html>").getBytes());
        BodyContentHandler contenthandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(bais, contenthandler, metadata, new ParseContext());
        assertEquals("\nThis is a very simple HTML document\n" + 
            "\n" + 
            "It only has two paragraphs\n" + 
            "\n", contenthandler.toString().replace("\r", ""));
        assertEquals("A Simple HTML Document", metadata.get("title"));
        assertEquals("A Simple HTML Document", metadata.get("dc:title"));
        assertNull(metadata.get("title2"));
        assertEquals("org.apache.tika.parser.DefaultParser", metadata.getValues("X-Parsed-By")[0]);
        assertEquals("org.apache.tika.parser.html.HtmlParser", metadata.getValues("X-Parsed-By")[1]);
        assertEquals("ISO-8859-1", metadata.get("Content-Encoding"));
        assertEquals("text/html; charset=ISO-8859-1", metadata.get("Content-Type"));
    }
}

score -1 · Accepted Answer

最简单的方法是使用postbin 目录中的工具。它会自动完成所有工作。这是示例

./post -c conf1 /path/to/files/*

更多信息在这里

solr - 如何在 SOLR 中索引 .html 文件

4 回答 4

Related

Reference