hadoop - MapReduceIndexerTool 无法正确重新索引文档

Question

我目前正在尝试使用在 Cloudera 快速入门 vm 上开发的 Cloudera Search 批处理索引来批处理我目前在文本文件中的数据。我相信我的架构和 morphline 有问题，因为它完成了工作，并且当我进入 Solr 仪表板时，它的索引但没有文档存在时似乎正在工作。核心显示，但它只是零文档。我确信我正在运行的命令和 cloudera 搜索在它允许我批量索引我使用示例输入文件、模式和 morphline 文件时的示例之前它工作正常，并且索引并将文档添加到核。我用来执行此操作的命令是：

hadoop --config /etc/hadoop/conf.cloudera.yarn jar  \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m'  \
--log4j '/usr/share/doc/search-1.0.0+cdh5.4.0+0/examples/solr-nrt/log4j.properties' \
--morphline-file /usr/share/doc/search-1.0.0+cdh5.4.0+0/examples/solr-nrt/test-morphlines/readMultiLine.conf \
--output-dir hdfs://quickstart.cloudera:8020/user/outdir --verbose --go-live \
--zk-host 127.0.0.1:2181/solr --collection collection1 \
hdfs://quickstart.cloudera:8020/user/indir

我的架构是：

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="sentences" version="1.5">         
 <fields>                   
   <field name="id" type="text_general" indexed="true" stored="true" required="true" multiValued="false" /> 
   <field name="sentence" type="text_general" indexed="true" stored="false"/>
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <dynamicField name="ignored_*" type="ignored"/>       
 </fields>    

 <uniqueKey>id</uniqueKey>

 <types>        
      <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
      <fieldType name="random" class="solr.RandomSortField" indexed="true" />
      <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
      <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>    
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>    
 </types>   
</schema>

对于我的 morphline 文件，我使用的是在示例中找到的仅读取单行的文件，即：

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [                    
      { 
        readLine {
          ignoreFirstLine : true
          commentPrefix : "#"
          charset : UTF-8
        }
      } 
      { logDebug { format : "output record: {}", args : ["@{}"] } }    
    ]
  }
]

我的示例输入是：（DocID tab Sentence）

1   For evening wear at the North Pole, girls could dress up in handsome Nordic sweaters and full iridescent taffeta skirts, or top one of the full striped skirts with a terrific short beige trench coat.    
2   But working to change the communist-run system is illegal, and the party relentlessly punishes dissent.    
3   Word of the latest document first came on Sept. 1, 1987, during a meeting between the pope and Jewish leaders in Castel Gandolfo, the pontiff's summer residence in the hills southeast of Rome.    
4   Anita Moen-Guidon of Norway was third, 2:28.6 behind Lazutina, and Russia's Julia Chepalova fourth, 2:53.5 behind.    
5   We have been beaten, we have shed blood, we have purchased the right to meet here today with our blood,'' said John Munuve, an assembly leader.
6   The folklore Nordic knits were handsome, in sweaters, or knee-length pants, and might have been topped by something like a super taffeta full coat.   
7   Several politicians have charged that the high taxes Kenyans already pay go into the pockets of government officials or wasteful projects, and not into providing essential services and repairing crumbling infrastructure.   
8   independence.

score 0 · Accepted Answer

在您的 schema.xml 中，您有id必填字段。但是，readLine仅将行读入“消息”字段。

所以你需要添加id到你的文档中。您可以使用setValues之类的东西，也可以使用制表符分隔符和列名将 readLine 更改为readCSV，每个应该是id：

readCSV {
  separator : "\t"
  columns : [id,message]
}

hadoop - MapReduceIndexerTool 无法正确重新索引文档

1 回答 1

Related

Reference