stormcrawler - Stormcrawler XPathFilter - 内部表示

Question

当 Stormcrawler 获取网站时，它会将配置的 XPathFilter 应用于不是原始 HTML 表示的 HTML 表示。例如，插入标签，或者DIV将变为H3等。例如，以下配置将HTML代码放入不是原始的Elasticsearch中：

 {
   "com.digitalpebble.stormcrawler.parse.ParseFilters": [
   {
   "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
   "name": "XPathFilter",
   "params": {
    "canonical": "//*[@rel=\"canonical\"]/@href",
    "parse.html": [
        "//HTML"
     ]
   }
 },
{
  "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
  "name": "DomainParseFilter",
  "params": {
    "key": "domain",
    "byHost": false
   }
  }
 ]
}

这使得很难根据网站的原始源代码编写 XPath 表达式。有什么方法可以配置 Stormcrawler 以在原始网站源代码上应用 XPathFilter 表达式？

score 0 · Accepted Answer

Which version of StormCrawler are you on? Are you using Tika for the parsing or Jsoup? AFAIK Jsoup does not modify the content but Tika probably does. I would recommend to use the JSoup-based ParserBolt for HTML content and Tika for anything else.

You can use the DebugParseFilter to see what the DOM looks like.

stormcrawler - Stormcrawler XPathFilter - 内部表示

1 回答 1

Related

Reference