我一直在研究一个爬虫,我必须在某个特定的服务器上发出 1000 多个请求。到目前为止,它运行良好。但是现在异步任务没有完成。这是我的示例代码。
private static CloseableHttpClient httpclient = HttpClients.createDefault();
private String getContent(TaskUrl taskUrl, String hostname, int port,
Map<String, String> basicHeaders) {
String uri = taskUrl.getUrl();
HttpHost proxyHost = new HttpHost(hostname, port);
RequestConfig.Builder reqconfigconbuilder = RequestConfig.custom();
// in case proxy are slow to fetch data timout can be increased to 7 sec. Any longer than that might make a negative impact
reqconfigconbuilder.setConnectionRequestTimeout(5000);
reqconfigconbuilder.setConnectTimeout(5000);
reqconfigconbuilder.setSocketTimeout(5000);
reqconfigconbuilder = reqconfigconbuilder.setProxy(proxyHost);
RequestConfig config = reqconfigconbuilder.build();
HttpGet httpget = new HttpGet(uri);
if (basicHeaders != null) {
for (Map.Entry<String, String> entry : basicHeaders.entrySet()) {
httpget.addHeader(entry.getKey(), entry.getValue());
}
}
List<String> userAgentList = CommonConstant.getCustomUserAgent();
int in = StringUtils.getRandomIntegerBetweenRange(0, userAgentList.size() - 1);
httpget.addHeader("User-Agent", userAgentList.get(in));
httpget.setConfig(config);
logger.debug("Now executing ");
try (CloseableHttpResponse response = httpclient.execute(httpget)) {
logger.info("Status code for url : {} {} with port : {} with host : {}", uri,
response.getStatusLine().getStatusCode(), port, hostname);
if (response.getStatusLine().getStatusCode() == 200) {
return EntityUtils.toString(response.getEntity());
} else if (response.getStatusLine().getStatusCode() == 404) {
return "404";
}
return null;
} catch (IOException e) {
logger.error("Error for url : {} {}", uri, e.getMessage());
return null;
}
对于前 150 到 200 个 url,它工作正常。但过了一段时间,我可以看到日志卡在Now Executing之后什么都没有发生。有一段时间,该过程在 1 小时后才恢复。任何人都可以在这方面帮助我。我不知道为什么它会这样。它不应该在完成任务之前停止。任何帮助,将不胜感激。