web-crawler - 如何扩展 Nutch 以进行文章抓取

Question

我正在寻找一个框架来抓取文章，然后我找到了 Nutch 2.1。这是我的计划和每个问题：

1

将文章列表页面添加到 url/seed.txt 这是一个问题。我真正想要被索引的是文章页面，而不是文章列表页面。但是，如果我不允许列表页被索引，Nutch 将什么也不做，因为列表页是入口。那么，我怎样才能只索引没有列表页面的文章页面呢？

2

编写一个插件来解析“作者”、“日期”、“文章正文”、“标题”以及可能来自 html 的其他信息。Nutch 2.1 中的 'Parser' 插件接口是： Parse getParse(String url, WebPage page) 并且 'WebPage' 类有一些预定义的属性：

public class WebPage extends PersistentBase {
  // ...
  private Utf8 baseUrl;
  // ...
  private ByteBuffer content; // <== This becomes null in IndexFilter
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,Utf8> headers;
  private Map<Utf8,Utf8> outlinks;
  private Map<Utf8,Utf8> inlinks;
  private Map<Utf8,Utf8> markers;
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

3

文章索引到 Solr 后，另一个应用程序可以通过“日期”查询它，然后将文章信息存储到 Mysql 中。我这里的问题是：Nutch 可以将文章直接存入Mysql 吗？或者我可以编写一个插件来指定索引行为吗？

Nutch 是否适合我的目的？如果没有，你们会为我推荐另一个优质的框架/库吗？谢谢你的帮助。

score 1 · Accepted Answer

如果您只需要从几个网站提取文章，请查看http://www.crawl-anywhere.com/

它带有一个管理 UI，您可以在其中指定要使用锅炉管道文章提取器（这很棒）。您还可以通过 URL 模式匹配您想要抓取的页面与您想要抓取和编入索引的页面来指定。

web-crawler - 如何扩展 Nutch 以进行文章抓取

1 回答 1

Related

Reference