I have a method that takes in URL and finds all the links on that page. However I am concerned if it is only taking links as when I check if the links are working or not, some of the links seem strange. For example if I check the links at www.google.com I get 6 broken links that return no http status code and instead says there is 'no protocol'for that broken link. I just wouldn't imagine google would have any broken links on its homepage. An example of one of the broken links is: /preferences?hl=en I can't see where this link is on the google homepage. I am curious if I am checking just links or is it possible I am extracting code that is not supposed to be a link?
Here is the method that checks the URL for links:
public static List getLinks(String uriStr) {
List result = new ArrayList<String>();
//create a reader on the html content
try{
System.out.println("in the getlinks try");
URL url = new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());
// Parse the HTML
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
kit.read(rd, doc, 0);
// Find all the A elements in the HTML document
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
while (it.isValid()) {
SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
String link = (String)s.getAttribute(HTML.Attribute.HREF);
if (link != null) {
// Add the link to the result list
System.out.println(link);
//System.out.println("link print finished");
result.add(link);
}
//System.out.println(link);
it.next();
}
}