我正在尝试从给定的 URL 中提取 URL String
,其中包含带有 HREF 标记的 HTTP 响应。我已到达链接的开头,但我需要在 HREF 结束后立即终止该字符串。这怎么可能实现?
public class Extracturl {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String line;
try {
String u="http://en.wikipedia.org/wiki/china";
String fileName = "e:\\test.txt";
BufferedWriter writer = new BufferedWriter(new FileWriter(fileName,true));
url = new URL(u);
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
String w=new String();
while ((line = dis.readLine()) != null) {
try {
if(line.contains("href=\"/wiki")&&line.contains("\" />")&& (!line.contains("File")))
{
if(!w.contains(line.substring(line.indexOf("href=\"/"))))
{w=w+line.substring(line.indexOf("href=\"/"));
System.out.println(line.substring(line.indexOf("href=\"/")));
writer.write(w);
writer.newLine();
}}
} catch (IOException e) {
e.printStackTrace();
}
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
// writer.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}
}
我什至试过
w=w+line.substring(line.indexOf("href=\"/"),line.indexOf("\">"));
但这给了我错误。
我的目标是获取从页面链接的所有 URL。