I have a doubt that I cannot answer to my self even when I was trying hard.
I think is a matter of comprehension.
So...
Im trying to index a long text field (a product description), which can have duplicates words. Lets say we are talking about a flavour and we say chocolate, then continues speaking and then again chocolate.
When solr is indexing, (as far as I understand the analysis tab in the solr control panel), it will create a term (which are "pointers", each term -> associated to a uniqueKey atribute which identify the "item")for each token we have.
Does the solr index gonna have two terms pointing to the same item ?
This is my text analyzer:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" enablePositionIncrements="true" />
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
I though deletes duplicates entries, but when I have a look to the analysis found this:
As far as I undestand solr, at the end, in my index there is gonna be this three terms pointing to that "item": chocolate, blablabla and chocolate. Is that right ?
I hope the question is clear :)
Thanks !