nutch - Nutch : 当前 URL 的锚文本

Question

在索引过滤器中，有没有办法找出当前 URL/文档源自的锚文本？我尝试了链接，但这似乎是空的。

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum,          Inlinks inlinks) IndexingException {

    //Need to know the anchor text from which the current document originated from at this  point

}

如果当前 URL 是http://foo.com/pagex ，则必须在 http://foo.com找到指向 pagex 的链接。我需要知道这个链接的锚文本。

score 0 · Accepted Answer

锚文本可以在链接中找到，但是要填充它，两者db.ignore.internal.links都linkdb.ignore.external.links必须设置为falsein nutch-default.xml。或者，它们可以在nutch-site.xml.

nutch - Nutch : 当前 URL 的锚文本

1 回答 1

Related

Reference