solr - 如何在 nutch 2.1 中抓取页面但不获取视频/图像内容？

Question

我想抓取一个页面，我只需要获取 HTML 本身，避免所有图像/视频等......可以这样做吗？提前致谢。

score 1 · Accepted Answer

检查 regex-urlfilter.txt 文件。

您可以包含您不想索引的文件扩展名的扩展名。例如

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

solr - 如何在 nutch 2.1 中抓取页面但不获取视频/图像内容？

1 回答 1

Related

Reference