0

I have a doubt that I cannot answer to my self even when I was trying hard.

I think is a matter of comprehension.

So...

  • Im trying to index a long text field (a product description), which can have duplicates words. Lets say we are talking about a flavour and we say chocolate, then continues speaking and then again chocolate.

  • When solr is indexing, (as far as I understand the analysis tab in the solr control panel), it will create a term (which are "pointers", each term -> associated to a uniqueKey atribute which identify the "item")for each token we have.

Does the solr index gonna have two terms pointing to the same item ?

This is my text analyzer:

<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

I though deletes duplicates entries, but when I have a look to the analysis found this:

screenshot

As far as I undestand solr, at the end, in my index there is gonna be this three terms pointing to that "item": chocolate, blablabla and chocolate. Is that right ?

I hope the question is clear :)

Thanks !

4

1 回答 1

7

您在 Analysis 之后看到的,就是在将文本索引到 Solr 之前。当您实际索引它时,它只存储每个术语一次,并以 (document_id, position) 的形式保存该术语的所有出现。

希望下面的例子更清楚。

假设您想将以下三个文档添加到 Solr:

T[0] = "dark chocolate is the best chocolate"

T[1] = "i love dark chocolate"

T[2] = "chocolate is delicious"

Solr 将存储在倒排索引中,如下所示:

“最佳”:{(T[0],位置)}

“巧克力” : {(T[0], position1), (T[0], position2), (T[1], position), (T[2], position)}

“黑暗”:{(T[0],位置),(T[1],位置)}

“美味”:{(T[2],位置)}

“我”:{(T[1],位置)}

“是”:{(T[0],位置),(T[1],位置)}

“爱”:{(T[0],位置)}

“的”:{(T[0],位置)}

笔记:

  • position 存储文档中term的开始偏移量和结束偏移量
  • 巧克力术语在索引中存储一次,但对文档 T[0] 有两次引用
于 2013-05-15T18:00:50.093 回答