0

我正在尝试使用 Apache Solr 和 TikaEntityProcessor 索引 HTML 文档,其想法是我可以使用 XPath 从 HTML 中选择特定元素。

我遵循了TikaEntityProcessor Solr Wiki 页面底部显示的高级示例。

当我尝试完成数据导入命令时,我收到以下错误消息:

03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
03-Oct-2012 16:39:48 org.apache.solr.core.SolrCore execute
INFO: [htmlTest] webapp=/apache-solr-3.6.1 path=/dataimport params={command=full-import} status=0 QTime=31 
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties
INFO: Read dataimport.properties
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [htmlTest] REMOVING ALL DOCUMENTS FROM INDEX
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
    commit{dir=C:\Program Files\Apache Tomcat\conf\apache-solr-3.5.0\htmlTest\data\index,segFN=segments_1e,version=1349187077567,generation=50,filenames=[_u.fnm, _u.nrm, _u.tis, _u.prx, _u.frq, _u.fdx, _u.fdt, _u.tii, segments_1e]
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1349187077567
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SqlEntityProcessor initQuery
SEVERE: The query failed 'null'
java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log
SEVERE: Exception while processing: tika-test document : SolrInputDocument[{text=text(1.0)={<html>

<meta name="Content-Encoding" content="ISO-8859-1">
<meta name="Content-Type" content="text/html">
<title></title>

<body>
    <h1>This is my first heading</h1>


        This is some content


    <h1>This is my second heading</h1>


        This is some more content


</body></html>}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    ... 11 more

03-Oct-2012 16:39:48 org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=*:*} 0 31
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log
SEVERE: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
    ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
    ... 5 more
Caused by: java.lang.NullPointerException
    at java.io.File.<init>(File.java:222)
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53)
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44)
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
    ... 11 more

03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback

我的数据导入配置是:

<dataConfig>
    <dataSource type="BinFileDataSource"/>
    <dataSource type="FieldReaderDataSource" name="fld"/> 
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                url="C:/Program Files/Apache Tomcat/conf/apache-solr-3.5.0/htmlTest/data/html_basic.html" format="html">
                <field column="text"/>
                <entity type="XPathEntityProcessor" forEach="/html" dataField="text">
                    <field xpath="//h1"  column="date" />
                </entity>
        </entity>
    </document>
</dataConfig>

Solr 正在索引的 HTML 文档是:

<html>
<head>
</head>
<body>
    <h1>This is my first heading</h1>
    <div>
        This is some content
    </div>
    <h1>This is my second heading</h1>
    <div>
        This is some more content
    </div>
</body>

4

1 回答 1

0

您似乎缺少对正确数据源的引用。它需要是名为dateSource的实体上的一个属性,它与数据源定义本身的属性名称相匹配。您似乎已经定义了名称fld但没有引用它。

我建议对数据源和相应的实体都明确地这样做以避免混淆。

于 2013-07-21T04:37:35.057 回答