xml - 在 solr 中输入任意 xml

Question

我有一个关于 Apache Solr 的问题。如果我有一个任意 XML 文件，以及它符合的 XSD，我如何将它输入到 Solr。我可以得到一个代码示例吗？我知道您必须解析 XML 并将相关数据放入 solr 输入文档中，但我不明白该怎么做。

score 7 · Accepted Answer

DataImportHandler (DIH) 允许您将传入的 XML 传递给 XSL，以及使用 DIH 转换器解析和转换 XML。您可以通过 XSL 将任意 XML 转换为 Solr 的标准输入 XML 格式，或将任意 XML 映射/转换为 DIH 配置文件中的 Solr 模式字段，或两者的组合。DIH 是灵活的。

示例 dih-config.xml

这是来自实际工作站点的示例 dih-config.xml（我的朋友，这里没有伪示例）。请注意，它会从 LAMP 服务器上的本地目录中获取 xml 文件。如果您更喜欢直接通过 HTTP 发布 xml 文件，则需要配置ContentStreamDataSource。

碰巧在这个示例中传入的 xml 已经是标准的 Solr 更新 xml 格式，而 XSL 所做的只是删除空字段节点，而真正的转换，例如从“ignored_seriestitle”构建“ispartof_t”的内容，“ ignore_seriesvolume" 和 "ignored_seriesissue" 是使用 DIH Regex 转换器完成的。（首先执行 XSLT，然后将其输出提供给 DIH 转换器。）属性“useSolrAddSchema”告诉 DIH xml 已经是标准 Solr xml 格式。如果不是这种情况，则需要XPathEntityProcessor上的另一个属性“xpath”来从传入的 xml 文档中选择内容。

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/cci/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

设置 DIH：

确保从 solrconfig.xml 引用 DIH jar，因为默认情况下它们不包含在 Solr WAR 文件中。一种简单的方法是在包含 DIH jar 的 Solr 实例目录中创建一个 lib 文件夹，因为 solrconfig.xml 默认在 lib 文件夹中查找引用。下载 Solr 包时，在 apache-solr-xxx/dist 文件夹中找到 DIH jar。

dist 文件夹： solr dih jars 位置

在 Solr“conf”目录中创建您的 dih-config.xml（如上）。
将 DIH 请求处理程序添加到 solrconfig.xml（如果它不存在）。

请求处理程序：

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</requestHandler>

要触发 DIH：

在Data Import Handler Commands的 wiki 描述中有更多关于 full-import 与 delta-import 以及是否提交、优化等的信息，但以下将触发 DIH 操作而不先删除现有索引，并且在处理完所有文件后提交更改。上面给出的示例将收集在拾取目录中找到的所有文件，转换它们，索引它们，最后，将更新/s 提交到索引（这将使它们在即时提交完成时可搜索）。

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

score 1 · Accepted Answer

最简单的方法可能是使用DataImportHandler，它允许您首先应用 XSL 将您的 xml 转换为 Solr 输入 xml

score 0 · Accepted Answer

经过一些研究并没有发现完全自动化的东西来做你所要求的......我想我找到了一些东西。

Lux SOLR 可能是我们正在寻找的http://luxdb.org/SETUP.html

似乎它以某种方式采用 SOLR 并使其启用 Lux 以索引任意 XML。

xml - 在 solr 中输入任意 xml

3 回答 3

示例 dih-config.xml

设置 DIH：

要触发 DIH：

Related

Reference