mongodb - 使用 crawler4j & Jsoup 获取 Http 状态

Question

我正在后端使用 MongoDB 创建一个 Groovy & Grails 应用程序。我使用 crawler4j 进行爬取，使用 JSoup 进行解析功能。我需要获取 URL 的 http 状态并将其保存到数据库。我正在尝试以下操作：

@Override
void visit(Page page) {
try{
    Document doc = Jsoup.connect(url).get();
    Connection.Response response = Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chroe/19.0.1042.0 Safari/535.21")
            .timeout(10000)
            .execute();
    int statusCode = response.statusCode();
    println "statuscode is " + statusCode
    if (statusCode == 200)
        urlExists = true    //urlExists is a boolean variable
    else 
        urlExists = false
    //save to database
    resource = new Resource(mimeType : "text/html", URLExists: urlExists)
    if (!resource.save(flush: true, failOnError: true)) {
        resource.errors.each { println it }
    }
    //other code
    }catch(Exception e) {
        log.error "Exception is ${e.message}"
    }
}
@Override
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
    if (statusCode != HttpStatus.SC_OK) {
        if (statusCode == HttpStatus.SC_NOT_FOUND) {
        println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
        }
        else {
            println  "Non success status for link: " + webUrl.getURL() + ", status code: " + statusCode + ", description: " + statusDescription
        }
    }
}

问题是，一旦我得到一个 http 状态不是 200(ok) 的 url，它就会直接转到 handlePageStatusCode() 方法（因为固有的 crawler4j 功能）并打印不成功的消息，但它没有保存到数据库中. 当页面状态不是 200 时，有什么方法可以保存到数据库中？如果我做错了什么，请告诉我。谢谢

score 0 · Accepted Answer

当它下降到handlePageStatusCode时，为什么不将它保存到数据库中？

    protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
    if (statusCode == HttpStatus.SC_NOT_FOUND) {
    println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()

    //save to database

    }
    else {
        println  "Non success status for link: " + webUrl.getURL() + ", status code: " +        tatusCode + ", description: " + statusDescription
    }
  }

}

然后它会尝试下一个链接，你可以做同样的事情。

或者你可以先保存

  if (statusCode == 200)
    urlExists = true    //urlExists is a boolean variable
     else {
           //save to database
           urlExists = false
        }

编辑****

将 webUrl.getURL() 添加到 ArrayList，然后最后将其保存到数据库中。

mongodb - 使用 crawler4j & Jsoup 获取 Http 状态

1 回答 1

Related

Reference