elasticsearch - 使用elasticsearch索引从Apache nutch爬取的数据？

Question

我在 aws ec2 ubuntu 实例上有 apache nutch 1.7 和 Elasticsearch 1.4.4。我使用 Nutch 抓取数据，但我们如何使用弹性搜索索引数据？没有与之相关的官方文档。

score 1 · Accepted Answer

在配置中启用弹性搜索索引器。将弹性索引器添加到插件 linclude 属性列表中。见下文：

    <property>
            <name>plugin.includes</name>
            <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>

score 1 · Accepted Answer

在您的 nutch-site.xml 中添加以下属性：

<property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

以上将使elasticsearch成为索引器。以下是指定elasticsearch的主机

<property>
        <name>elastic.host</name>
        <value>localhost</value>
</property>

您可以设置的其他可选属性是 elastic.port、elastic.cluster 等。

现在你指定你已经爬取了数据并且现在想要索引它，所以你可以使用

./bin/nutch index <crawldb> -dir <segment_dir>

这将索引驻留在段中的所有爬网数据。您可以检查文档的弹性搜索索引。

elasticsearch - 使用elasticsearch索引从Apache nutch爬取的数据？

2 回答 2

Related

Reference