看看源头。通常,分析器非常易读。您只需要查看CreateComponents
方法即可查看它正在使用的 Tokenizer 和 Filters:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new KeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
return new TokenStreamComponents(source, result);
}
而 ,StandardAnalyzer
只是StandardTokenizer
, StandardFilter
, LowercaseFilter
, 和StopFilter
. EnglishAnalyzer
滚入EnglishPossesiveFilter
,KeywordMarkerFilter
和PorterStemFilter
.
主要是,EnglishAnalyzer 加入了一些英语词干增强功能,这对于纯英文文本应该很有效。
对于 StandardAnalyzer,我所知道的与英语分析直接相关的唯一假设是默认停用词集,当然,这只是一个默认值,可以更改。StandardAnalyzer 现在实现Unicode Standard Annex #29,它试图提供非特定语言的文本分段。