在 Cloudera 上,我认为您有以下选择:
关于 MapReduceIndexerTool 这里有一个快速指南:
使用 MapReduceIndexerTool 将 csv 索引到 SolR
本指南向您展示如何使用MapReduceIndexerTool.csv
将文件索引/上传到 SolR 。此过程将从 HDFS 读取 csv 并直接将索引写入 HDFS 内。
另请参阅https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html。
假设你有:
- 一个有效的 cloudera 安装(请参阅
THIS_IS_YOUR_CLOUDERA_HOST
,如果使用 Docker 快速入门,它应该是quickstart.cloudera
)
- 存储在 HDFS 中的 csv 文件(请参阅
THIS_IS_YOUR_INPUT_CSV_FILE
,如/your-hdfs-dir/your-csv.csv
)
- 已配置预期字段的有效目标 SolR 集合(请参阅 参考资料
THIS_IS_YOUR_DESTINATION_COLLECTION
)
- 输出目录将是配置的 SolR
instanceDir
(请参阅参考资料THIS_IS_YOUR_CORE_INSTANCEDIR
)并且应该是 HDFS 路径
对于这个例子,我们将处理一个带有uid
,firstName
和lastName
列的 TAB 分隔文件。第一行包含标题。Morphlines 配置文件将跳过第一行,因此实际的列名无关紧要,列应按此顺序排列。在 SolR 上,我们应该使用类似的内容配置字段:
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="uid" type="string" indexed="true" stored="true" required="true" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" multiValued="true" />
csv-to-solr-morphline.conf
然后您应该使用以下代码创建一个 Morphlines 配置文件 ( ):
# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
# Name of solr collection
collection : THIS_IS_YOUR_DESTINATION_COLLECTION
# ZooKeeper ensemble
zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"
}
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readCSV {
separator : "\t"
# This columns should map the one configured in SolR and are expected in this position inside CSV
columns : [uid,lastName,firstName]
ignoreFirstLine : true
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml.
#
# Recall that Solr throws an exception on any attempt to load a document
# that contains a field that is not specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at DEBUG level to SLF4J
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
要导入,请在集群内运行以下命令:
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/ \
--morphline-file ./csv-to-solr-morphline.conf \
--zk-host quickstart.cloudera:2181/solr \
--solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \
--collection THIS_IS_YOUR_DESTINATION_COLLECTION \
--go-live \
hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE
一些考虑:
其他资源: