nutch - 如何在 Nutch 中加快爬行速度

Question

我正在尝试开发一个应用程序，在该应用程序中，我将为 Nutch 中的 urls 文件提供一组受限的 url。我可以通过从段中读取数据来抓取这些 url 并获取它们的内容。

我通过给出深度 1 进行了爬网，因为我不关心网页中的外链或内链。我只需要 urls 文件中的网页内容。

但执行此爬网需要时间。所以，建议我减少爬行时间并提高爬行速度的方法。我也不需要索引，因为我不关心搜索部分。

有没有人有关于如何加快爬行的建议？

score 7 · Accepted Answer

获得速度的主要内容是配置 nutch-site.xml

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>

score 6 · Accepted Answer

您可以在 nutch-site.xml 中扩展线程。增加 fetcher.threads.per.host 和 fetcher.threads.fetch 都会提高你爬取的速度。我注意到了巨大的改进。但是，在增加这些时要小心。如果您没有硬件或连接来支持这种增加的流量，则爬网中的错误数量可能会显着增加。

score 4 · Accepted Answer

对我来说，这个属性对我帮助很大，因为一个缓慢的域可以减慢所有 fetch 阶段：

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

例如，如果您尊重 robots.txt（默认行为）并且域太长而无法抓取，则延迟将为：fetcher.max.crawl.delay。而且队列中的很多这个域会减慢所有的fetch阶段，所以最好限制generate.max.count。

您可以以相同的方式添加此属性以限制获取阶段的时间：

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

但是请不要碰 fetcher.threads.per.queue 属性，你会在黑名单中完成......这不是提高爬取速度的好方法......

score 2 · Accepted Answer

你好我也是这个爬行的新手，但我使用了一些方法我得到了一些好的结果可能你会用这些属性更改我的 nutch-site.xml

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

请提出更多选择谢谢

score 0 · Accepted Answer

我有类似的问题，可以在https://wiki.apache.org/nutch/OptimizingCrawls的帮助下提高速度

它提供了有用的信息，其中包括哪些可能会减慢您的爬网速度以及您可以采取哪些措施来改善这些问题。

不幸的是，在我的情况下，我的队列非常不平衡，并且不能向更大的队列请求太快，否则我会被阻塞，所以我可能需要先使用集群解决方案或 TOR，然后才能进一步加快线程速度。

score -1 · Accepted Answer

如果您不需要关注链接，我认为没有理由使用 Nutch。您可以简单地获取您的 url 列表并使用 http 客户端库或使用 curl 的简单脚本获取这些。

nutch - 如何在 Nutch 中加快爬行速度

6 回答 6

Related

Reference