lucene - 如何覆盖lucene中的停用词

Question

我正在我的文件夹中创建一个 Lucene 索引，并且正在索引 txt 文件的内容。我希望我的内容没有停用词索引，但是在通过分析器后实际上在搜索时让我脱离了停用词，但我所有的文本索引。我把代码放在下面：

    IndexWriter writer = new IndexWriter(new SimpleFSDirectory(indexDir),
                        new SpanishAnalyzer(Version.LUCENE_36),
                        create,
                        IndexWriter.MaxFieldLength.UNLIMITED);
    if (!file.isHidden() && file.exists() && file.canRead()) {


                String fileName = file.getName();
                String type = Files.extension(file);
                if(type==null)
                {
                    type="";
                }
                Document d = new Document();

                d.add(new Field("Name",fileName,
                                Store.YES,Index.ANALYZED,Field.TermVector.YES));
                d.add(new Field("Type",type,
                                Store.YES,Index.ANALYZED));
                if(("txt".equals(type.toLowerCase())) || ("log".equals(type.toLowerCase())))
                {
                    String Content = Files.readFromFile(file,"ASCII");
        d.add(new Field("Content",Content,Store.YES,Index.ANALYZED, Field.TermVector.YES));
                }
    }

    writer.addDocument(d);

示例文件的内容是“安装目录”。如果我对“a”、“to”、“of”进行搜索，但没有找到任何东西，这意味着我已经成功通过了分析器。使用该工具查看索引 LUKE，我看到该字段包含“安装到目录”，但查看 Field.TermVector 仅包含：“安装”和“目录”，这就是我想要出现的全部内容场。

谢谢你。

score 2 · Accepted Answer

您正在使用 SpanishAnalyzer() 的默认构造函数。您应该使用带有停用词的那个作为参数。

如下创建索引器：

IndexWriter writer = new IndexWriter(new SimpleFSDirectory(indexDir),
                    new SpanishAnalyzer(Version.LUCENE_36, new HashSet<String>()),
                    create,
                    IndexWriter.MaxFieldLength.UNLIMITED);

在这里，我们传递了一组空的停用词，因此覆盖了没有停用词的默认值。您应该在此处阅读有关lucene 停用词的更多信息。

score 1 · Accepted Answer

尝试为 : 使用不同的构造函数，SpanishAnalyzer而不是

new SpanishAnalyzer(Version.LUCENE_36)
利用
new SpanishAnalyzer(Version.LUCENE_36, Collections.emptySet())

lucene - 如何覆盖lucene中的停用词

2 回答 2

Related

Reference