0

I am using solr5.3.

I am trying to upload wikipedia page article dump to solr using "DataImportHandler" but I am getting only id and title files when i am querying.

Below is my data-config.xml

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/mnt/TEST/enwiki-20150602-pages-articles1.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

Also I have added below entires to schema.xml.

 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="title"     type="string"  indexed="true" stored="false"/>
    <field name="revision"  type="int"    indexed="true" stored="true"/>
    <field name="user"      type="string"  indexed="true" stored="true"/>
    <field name="userId"    type="int"     indexed="true" stored="true"/>
    <field name="text"      type="text_en"    indexed="true" stored="false"/>
    <field name="timestamp" type="date"    indexed="true" stored="true"/>
    <field name="titleText" type="text_en"    indexed="true" stored="true"/>

I have copied schema.xml from "example/example-DIH/solr/solr/conf/schema.xml" and removed all field entries with few exceptions as mentioned in comments.

After importing data I am just trying to fetch all fields but I am getting only "Id" and "Title".

Also I tried to run documentImport using debug mode so that I can get some information regarding indexing, but at whenever i am selecting debug mode it is only importing 2 documents. I am not sure why? Due to this reason I am not able to debug the indexing process.

Please guide me further.

EDIT-I am now sure that other fields are not getting indexed because when I am specifying df=user or text, I am getting below message.

"msg": "undefined field user",

I am querying like below: http://localhost:8983/solr/wiki/select?q=%3A&fl=id%2Ctitle%2Ctext%2Crevision&wt=json&indent=true&debugQuery=true

4

3 回答 3

1

提供的设置仅适用于经典模式。但是在 solrconfig 默认情况下启用了托管模式。因此我没有收到短信。对于托管模式,我不需要定义“schema.xml”,我应该在 data-config.xml 中定义字段,如下所示。

 <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title_s"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user_s"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text_s"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
于 2015-12-18T06:29:34.827 回答
0

我亲爱的朋友,您只是错误地输入了其中一个字段。试试这个链接,你会想一边笑一边哭。

http://localhost:8983/solr/wiki/select?q=*%3A*&fl=id+titleText+user+revision&wt=json&indent=true

您在架构中提到的标题是“titleText”,您的限制分别提到了“title”和“text”。所以上帝速度,你可以通过环聊与我保持联系:porous999@gmail.com

于 2015-09-21T05:38:35.340 回答
0

我最近尝试使用 Solr 7 进行相同的 wikipedia 导入。未返回文本的原因是 managed_schema 中的该字段设置为 stored="false":

<field name="text" type="text_en" indexed="true" stored="false"/>

将其更改为 stored="true" 将返回文本。

当前接受的答案建议使用 text_s 字段,该字段可能存储在 OP 正在使用的 Solr 版本的 managed_schema 中。请注意,搜索任何未存储的字段中包含的术语仍将返回相关文档,仅不返回文本本身。有关更多信息,请参见此处的答案:Solr 索引与存储

于 2018-10-12T09:43:49.857 回答