Find centralized, trusted content and collaborate around the technologies you use most.
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
是否可以通过 Nutch 仅抓取/获取纯 HTML 页面(即没有图片、视频、flash、excel、exe、pdf 或 word 文件)?
如何通过 Nutch检查Content-Type页面并仅获取页面?text/html
Content-Type
text/html
编辑conf/regex-urlfilter.txt:
conf/regex-urlfilter.txt
为忽略设置文件后缀:
-\.(jpg|gif|zip|ico)$