我正在尝试抓取 300,000 个 URL。但是,在中间的某个地方,当尝试从 URL 检索响应代码时,代码会挂起。我不确定发生了什么问题,因为正在建立连接,但之后问题就出现了。我已经按照建议修改了设置读取超时和请求属性的代码。但是,即使现在代码也无法获取响应代码!任何建议/指针将不胜感激。另外,有没有办法在某个时间段内ping一个网站,如果它没有响应,就继续下一个?
这是我修改后的代码片段:
URL url=null;
try
{
Thread.sleep(8000);
}
catch (InterruptedException e1)
{
e1.printStackTrace();
}
try
{
//urlToBeCrawled comes from the database
url=new URL(urlToBeCrawled);
}
catch (MalformedURLException e)
{
e.printStackTrace();
//The code is in a loop,so the use of continue.I apologize for putting code in the catch block.
continue;
}
HttpURLConnection huc=null;
try
{
huc = (HttpURLConnection)url.openConnection();
}
catch (IOException e)
{
e.printStackTrace();
}
try
{
//Added the request property
huc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
huc.setRequestMethod("HEAD");
}
catch (ProtocolException e)
{
e.printStackTrace();
}
huc.setConnectTimeout(1000);
try
{
huc.connect();
}
catch (IOException e)
{
e.printStackTrace();
continue;
}
int responseCode=0;
try
{
//Sets the read timeout
huc.setReadTimeout(15000);
//Code hangs here for some URL which is random in each run
responseCode = huc.getResponseCode();
}
catch (IOException e)
{
huc.disconnect();
e.printStackTrace();
continue;
}
if (responseCode!=200)
{
huc.disconnect();
continue;
}