hadoop - MapReduceIndexerTool - 在 Solr 中索引 HDFS 文件的最佳方法？

Question

我有一个要求，我必须将 HDFS 文件（包括 TXT、PDF、DOCX、其他丰富的文档）索引到 Solr。

目前，我正在使用DirectoryIngestMapperLucidWorks 连接器来实现相同的目的。 https://github.com/lucidworks/hadoop-solr

但我不能使用它，因为它有一定的限制（主要是你不能指定要考虑的文件类型）。

所以现在我正在研究使用MapReduceIndexerTool. 但它没有很多初学者（我的意思是绝对基础！）级别的例子。

有人可以发布一些链接，其中包含从 MapReduceIndexerTool 开始的示例吗？还有其他更好或更简单的方法来索引 HDFS 中的文件吗？

score 3 · Accepted Answer

在 Cloudera 上，我认为您有以下选择：

MapReduceIndexerTool
CrunchIndexer工具
自定义 spark 或 map reduce 任务，例如使用spark-solr

关于 MapReduceIndexerTool 这里有一个快速指南：

使用 MapReduceIndexerTool 将 csv 索引到 SolR

本指南向您展示如何使用MapReduceIndexerTool.csv将文件索引/上传到 SolR 。此过程将从 HDFS 读取 csv 并直接将索引写入 HDFS 内。

另请参阅https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html。

假设你有：

一个有效的 cloudera 安装（请参阅THIS_IS_YOUR_CLOUDERA_HOST，如果使用 Docker 快速入门，它应该是quickstart.cloudera）
存储在 HDFS 中的 csv 文件（请参阅THIS_IS_YOUR_INPUT_CSV_FILE，如/your-hdfs-dir/your-csv.csv）
已配置预期字段的有效目标 SolR 集合（请参阅参考资料THIS_IS_YOUR_DESTINATION_COLLECTION）
- 输出目录将是配置的 SolR instanceDir（请参阅参考资料THIS_IS_YOUR_CORE_INSTANCEDIR）并且应该是 HDFS 路径

对于这个例子，我们将处理一个带有uid,firstName和lastName列的 TAB 分隔文件。第一行包含标题。Morphlines 配置文件将跳过第一行，因此实际的列名无关紧要，列应按此顺序排列。在 SolR 上，我们应该使用类似的内容配置字段：

<field name="_version_" type="long" indexed="true" stored="true" />
<field name="uid" type="string" indexed="true" stored="true" required="true" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" multiValued="true" />

csv-to-solr-morphline.conf然后您应该使用以下代码创建一个 Morphlines 配置文件 ( )：

# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
  # Name of solr collection
  collection : THIS_IS_YOUR_DESTINATION_COLLECTION

  # ZooKeeper ensemble
  zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"
}


# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]

    commands : [
      {
        readCSV {
          separator : "\t"
          # This columns should map the one configured in SolR and are expected in this position inside CSV
          columns : [uid,lastName,firstName]
          ignoreFirstLine : true
          quoteChar : ""
          commentPrefix : ""
          trim : true
          charset : UTF-8
        }
      }

      # Consume the output record of the previous command and pipe another
      # record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml.
      #
      # Recall that Solr throws an exception on any attempt to load a document
      # that contains a field that is not specified in schema.xml.
      {
        sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      # log the record at DEBUG level to SLF4J
      { logDebug { format : "output record: {}", args : ["@{}"] } }

      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

    ]
  }
]

要导入，请在集群内运行以下命令：

hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
  org.apache.solr.hadoop.MapReduceIndexerTool \
  --output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/  \
  --morphline-file ./csv-to-solr-morphline.conf \
  --zk-host quickstart.cloudera:2181/solr \
  --solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \
  --collection THIS_IS_YOUR_DESTINATION_COLLECTION \
  --go-live \
  hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE

一些考虑：

您可以使用sudo -u hdfs运行上述命令，因为您不应该有写入 HDFS 输出目录的权限。
默认情况下，Cloudera QuickStart 具有非常小的内存和堆内存配置。如果您收到内存不足异常或堆异常，我建议使用 Cloudera Manager->Yarn->Configurations 增加它（http://THIS_IS_YOUR_CLOUDERA_HOST:7180/cmf/services/11/config#filterdisplayGroup=Resource+Management ）我用过1 GB 内存和 500MB 用于 map 和 reduce 作业的堆。考虑同时改变yarn.app.mapreduce.am.command-opts,mapreduce.map.java.opts和insidemapreduce.map.memory.mbmapreduce.map.memory.mb/etc/hadoop/conf/map-red-sites.xml

其他资源：

score 1 · Accepted Answer

但我不能使用它，因为它有一定的限制（主要是你不能指定要考虑的文件类型）。

使用https://github.com/lucidworks/hadoop-solr，输入是一条路径。

因此，您可以通过文件名指定。

-i /path/*.pdf

编辑：

您可以添加add.subdirectories参数。但是*.pdf没有递归设置gitsource

-Dadd.subdirectories=true

hadoop - MapReduceIndexerTool - 在 Solr 中索引 HDFS 文件的最佳方法？

2 回答 2

使用 MapReduceIndexerTool 将 csv 索引到 SolR

Related

Reference