nutch - nutch crawling 卡在 spinwaiting 或 active 中。如何减少获取周期？

Question

我正在使用 nutch 2.1 并爬取一个站点。问题是爬虫一直显示获取 url spinwaiting/active 并且由于获取需要很长时间，因此与 mysql 的连接超时。我怎样才能减少一次提取的次数，使mysql不会超时？nutch 中是否有一个设置，我可以说只获取 100 或 500 个 url，然后解析并存储到 mysql，然后再次获取下一个 100 或 500 个 url？

错误信息：

Unexpected error for http://www.example.com
java.io.IOException: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
    at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.output(FetcherReducer.java:663)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:534)
Caused by: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
    at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
    ... 5 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
    at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3364)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1983)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
    ... 7 more
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3345)
    ... 13 more

score 1 · Accepted Answer

我正在使用 nutch 2.1 并爬取一个站点。问题是爬虫一直显示获取 url spinwaiting/active 并且由于获取需要很长时间，因此与 mysql 的连接超时。我怎样才能减少一次提取的次数，以使 mysql 不会超时？

为了减少获取次数，您可以将以下属性添加到您的 nutch-site.xml 并根据需要编辑该值。请不要编辑 nutch-default.xml 而是将属性复制到 nutch-site.xml 并从那里管理值：

  <property>
    <name>fetcher.threads.fetch</name>
    <value>20</value>
  </property>

关于超时问题，您可以将此属性添加到您认为需要的加载时间值到您的 nutch-site.xml 中。

<property>
  <name>http.timeout</name>
  <value>240000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

nutch 中是否有一个设置，我可以说只获取 100 或 500 个 URL，然后解析并存储到 mysql，然后再次获取下一个 100 或 500 个 URL？

Nutch 在一个循环中进行爬网 - 在您在爬网命令中指定的称为“深度”的多次迭代中生成/获取/解析/更新。如果您想控制您的抓取，您可以按照教程链接http://wiki.apache.org/nutch/NutchTutorial的第 3.2 节（使用单个命令进行全网抓取）中所述执行每个步骤。这将为您提供良好的方向并准确了解正在发生的事情。在获取每个段时检查状态，以便您知道在每个段中获取了多少个 url

nutch - nutch crawling 卡在 spinwaiting 或 active 中。如何减少获取周期？

1 回答 1

Related

Reference