0

我正在使用 Nutch。我计划抓取共享磁盘而不是互联网网站。

我担心的一件事是爬行它会使磁盘变得非常慢。如何避免爬取共享磁盘而不将其关闭?

4

1 回答 1

1

您可以在 conf/nutch-site.xml 中设置请求之间的线程数和等待时间。

尝试覆盖这些属性并将它们设置为您觉得舒服的值:

<property>
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>

<property>
  <name>fetcher.threads.per.queue</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a queue at one time.
   </description>
</property>
于 2013-10-16T22:23:12.733 回答