我正在后端使用 MongoDB 创建一个 Groovy & Grails 应用程序。我使用 crawler4j 进行爬取,使用 JSoup 进行解析功能。我需要获取 URL 的 http 状态并将其保存到数据库。我正在尝试以下操作:
@Override
void visit(Page page) {
try{
Document doc = Jsoup.connect(url).get();
Connection.Response response = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chroe/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
println "statuscode is " + statusCode
if (statusCode == 200)
urlExists = true //urlExists is a boolean variable
else
urlExists = false
//save to database
resource = new Resource(mimeType : "text/html", URLExists: urlExists)
if (!resource.save(flush: true, failOnError: true)) {
resource.errors.each { println it }
}
//other code
}catch(Exception e) {
log.error "Exception is ${e.message}"
}
}
@Override
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
if (statusCode == HttpStatus.SC_NOT_FOUND) {
println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
}
else {
println "Non success status for link: " + webUrl.getURL() + ", status code: " + statusCode + ", description: " + statusDescription
}
}
}
问题是,一旦我得到一个 http 状态不是 200(ok) 的 url,它就会直接转到 handlePageStatusCode() 方法(因为固有的 crawler4j 功能)并打印不成功的消息,但它没有保存到数据库中. 当页面状态不是 200 时,有什么方法可以保存到数据库中?如果我做错了什么,请告诉我。谢谢