c# - 使用 Lucene.NET 索引全文文档是否需要 IFilter

Question

我正在我的项目中前进，并来到处理文件内容的十字路口。我已经成功创建了一个包含一些分类字段的工作索引，但我知道希望将关键字搜索应用于文件内容。我的问题是我不确定将 lucene 传递给阅读器是否会转换为索引整个文件内容的 API。我在网上做了一些搜索，发现需要 IFilter 的建议是真的吗？似乎有些复杂。无论如何，我用于索引文件内容的代码在下面并且不起作用（如果通过阅读器，它会失败）。理想情况下，我希望能够处理 doc 和 docx 文件。任何帮助深表感谢。

我的代码创建阅读器

public void setFileText()
        {

            var FD = new System.Windows.Forms.OpenFileDialog();
            StreamReader reader;
            if (FD.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                string fileToOpen = FD.FileName;
                reader = new StreamReader(fileToOpen);
            }
            else
            {
                reader = null;
            }
            this.FileText =  reader;
        }
}

我将文档添加到索引的代码

 private static void _addToLuceneIndex(MATS_Doc Data, IndexWriter writer)
        {
            // remove older index entry
        //    Query searchQuery = new TermQuery(new Term("Id", Data.Id.ToString()));
          //  writer.DeleteDocuments(searchQuery);

            // add new index entry
            Document doc = new Document();

            // add lucene fields mapped to db fields

            doc.Add(new Field("Id", Data.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Title))
                doc.Add(new Field("Title", Data.Title, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Plant))
                doc.Add(new Field("Plant", Data.Plant, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Containment))
                doc.Add(new Field("Containment", Data.Containment, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Part))
                doc.Add(new Field("Part", Data.Part, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Operation))
                doc.Add(new Field("Operation", Data.Operation, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Geometry))
                doc.Add(new Field("Geometry", Data.Geometry, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (Data.FileText != null)
                doc.Add(new Field("Text", Data.FileText));
            // add entry to index
            writer.AddDocument(doc);
        }

score 2 · Accepted Answer

使用 IFitlers 实际上非常简单。

我建议使用 Eclipse.IndexingService（在 c# 中）。

然后你所要做的（除了安装 IFitlers 如果需要）是：

using (FilterReader filterReader = new FilterReader(path, Path.GetExtension(path)))
{
     filterReader.Init();
     string content = filterReader.ReadToEnd();
}

您可以在此处阅读有关 IFitler 的更多信息：

http://www.codeproject.com/Articles/31944/Implementing-a-TextReader-to-extract-various-files

http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

score 0 · Accepted Answer

另一个可能值得研究的选择是使用RavenDB，它在内部实现了 Lucene.Net 的索引引擎。看起来你在桌面应用程序中，所以你应该考虑 RavenDB 的嵌入模式。

然后，您可以使用我的Indexed Attachments Bundle - 它为您管理大部分内容。您只需将文档作为附件上传，它就会使用 IFilters 从文档中提取文本。它会自动在该文本上建立索引。然后，您可以对该索引执行全文 Lucene 搜索。如果需要，您甚至可以突出显示找到的搜索词。

目前缺少该捆绑包的文档，但您应该能够从单元测试中收集您需要的内容。

score 0 · Accepted Answer

Lucene 本身无法处理 .doc 和 .docx 文件。Solr在这里可能值得一看，因为 Lucene 本身只是一个用于构建搜索引擎的库。

c# - 使用 Lucene.NET 索引全文文档是否需要 IFilter

3 回答 3

Related

Reference