filesystems - 如何制作nutch爬取文件系统？

Question

不基于http，

比如http://localhost:81等等，

而是直接爬取本地文件系统上的某个目录，

有什么出路吗？

score 4 · Accepted Answer

来自 Nutch 维基：

如何索引我的本地文件系统？

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1） crawl-urlfilter.txt 需要更改以允许 file: URLs 而不是 http: 的，否则它不会索引任何内容，或者它会从你的磁盘跳到网站上。更改此行：

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt 的底部可能有规则来拒绝某些 URL。如果它有这个片段可能没问题：

  # accept anything else +.*

3) 我更改了我的 nutch.xml 以包含以下内容：

<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

score 1 · Accepted Answer

1

nutch 具有可用的 Intranet 爬取功能。你可以在这里阅读详细信息

于 2009-06-12T18:25:53.213 回答

filesystems - 如何制作nutch爬取文件系统？

2 回答 2

Related

Reference