solr - Duplicates terms on solr index

Question

I have a doubt that I cannot answer to my self even when I was trying hard.

I think is a matter of comprehension.

So...

Im trying to index a long text field (a product description), which can have duplicates words. Lets say we are talking about a flavour and we say chocolate, then continues speaking and then again chocolate.
When solr is indexing, (as far as I understand the analysis tab in the solr control panel), it will create a term (which are "pointers", each term -> associated to a uniqueKey atribute which identify the "item")for each token we have.

Does the solr index gonna have two terms pointing to the same item ?

This is my text analyzer:

<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

I though deletes duplicates entries, but when I have a look to the analysis found this:

screenshot

As far as I undestand solr, at the end, in my index there is gonna be this three terms pointing to that "item": chocolate, blablabla and chocolate. Is that right ?

I hope the question is clear :)

Thanks !

score 7 · Accepted Answer

您在 Analysis 之后看到的，就是在将文本索引到 Solr 之前。当您实际索引它时，它只存储每个术语一次，并以 (document_id, position) 的形式保存该术语的所有出现。

希望下面的例子更清楚。

假设您想将以下三个文档添加到 Solr：

T[0] = "dark chocolate is the best chocolate"

T[1] = "i love dark chocolate"

T[2] = "chocolate is delicious"

Solr 将存储在倒排索引中，如下所示：

“最佳”：{（T[0]，位置）}

“巧克力” : {(T[0], position1), (T[0], position2), (T[1], position), (T[2], position)}

“黑暗”：{（T[0]，位置），（T[1]，位置）}

“美味”：{（T[2]，位置）}

“我”：{（T[1]，位置）}

“是”：{（T[0]，位置），（T[1]，位置）}

“爱”：{（T[0]，位置）}

“的”：{（T[0]，位置）}

笔记：

position 存储文档中term的开始偏移量和结束偏移量
巧克力术语在索引中存储一次，但对文档 T[0] 有两次引用

solr - Duplicates terms on solr index

1 回答 1

Related

Reference