apache - 作业失败时，nutch 2.0 重复获取页面

Question

我使用 mysql 作为带有 nutch 的存储后端。

抓取某些网站时作业失败。到达此页面时出现以下异常并退出nutch：http: //www.appchina.com/users.html

Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
    at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

所以我修改 ./src/java/org/apache/nutch/util/NutchJob.java 将 if (getConfiguration().getBoolean("fail.on.job.failure", true)) { 更改为 if (getConfiguration( ).getBoolean("fail.on.job.failure", false)) {

重新编译后，我不会得到任何异常，而是无限重启爬取。

FetcherJob : timelimit set for : -1
FetcherJob: threads: 30
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 30
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.appchina.com/
fetching http://www.appchina.com/users.html
-finishing thread FetcherThread0, activeThreads=29
-finishing thread FetcherThread29, activeThreads=28
...
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0.4 pages/s, 137 137 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:  false
ParserJob: parsing all
Parsing http://www.appchina.com/
Parsing http://www.appchina.com/users.html

hadoop.log 中的更新错误

2012-09-17 18:48:51,257 WARN  mapred.LocalJobRunner - job_local_0004
java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
        at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
        at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
        at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
        ... 6 more
Caused by: java.sql.SQLException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
        at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
        at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
        ... 8 more

再次更新

我已经删除了 gora 创建的表，并使用 VARCHAR(128) id 和 utf8mb4 DEFAULT CHARSET 创建了一个类似的表。现在可以了。为什么？

有人帮忙吗？

score 0 · Accepted Answer

您需要为 Parse 作业添加 hadoop 日志。附加的堆栈跟踪未显示该信息。在您更改代码后，解析是否成功？

apache - 作业失败时，nutch 2.0 重复获取页面

1 回答 1

Related

Reference