c#-4.0 - 如何防止 Solr 添加页眉和页脚？

Question

我有一个抓取网站内容的网络爬虫（Ncrawler），并且我添加了代码以将数据索引到 solr。我的要求是避免将网站的页眉、页脚和导航窗格添加到 solr 进行索引。

有没有办法做到这一点？任何帮助将不胜感激。

谢谢，阿努

score 0 · Accepted Answer

您可以利用在HtmlDocumentProcessor构造函数上具有 filterTextRules 参数的类。此参数需要作为Dictionary<string,string>用于过滤标记的开始和结束字符串传递。

例如，假设您的 html 页面中有页眉和页脚，它们在 html 中的结构如下：

 <!-- Begin Header -->
 all header markup is here
 <!-- End Header -->

 <!-- Begin Footer -->
 all footer markup is here
 <!-- End Footer -->

在这种情况下，您可以在管道中初始化 HtmlDocumentProcessor，如下所示：

    var pipelines = new IPipelineStep[]
               {
                  new HtmlDocumentProcessor(
                        new Dictionary<string, string>
                            {
                               {"<!--Begin Header", "<!--End Header"},
                               {"<!--Begin Footer", "<!--End Footer"},
                            }, 
                            null), 
                         new PdfIFilterProcessor(), 
                         new TextDocumentProcessor(), 
                };

    using (var crawler = new NCrawler.Crawler(new Uri("http://ncrawler.codeplex.com"),
             pipelines))
    {
          //Processing here
    }

希望这会有所帮助。有关 filterTextRules 参数及其工作原理的更多详细信息，请参阅HtmlDocumentProcessor 源代码。

c#-4.0 - 如何防止 Solr 添加页眉和页脚？

1 回答 1

Related

Reference