5

I have a method that takes in URL and finds all the links on that page. However I am concerned if it is only taking links as when I check if the links are working or not, some of the links seem strange. For example if I check the links at www.google.com I get 6 broken links that return no http status code and instead says there is 'no protocol'for that broken link. I just wouldn't imagine google would have any broken links on its homepage. An example of one of the broken links is: /preferences?hl=en I can't see where this link is on the google homepage. I am curious if I am checking just links or is it possible I am extracting code that is not supposed to be a link?

Here is the method that checks the URL for links:

public static List getLinks(String uriStr) {

    List result = new ArrayList<String>();
    //create a reader on the html content
    try{
        System.out.println("in the getlinks try");
    URL url = new URI(uriStr).toURL();
    URLConnection conn = url.openConnection();
    Reader rd = new InputStreamReader(conn.getInputStream());

    // Parse the HTML
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    kit.read(rd, doc, 0);

    // Find all the A elements in the HTML document
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
    while (it.isValid()) {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

        String link = (String)s.getAttribute(HTML.Attribute.HREF);
        if (link != null) {
                // Add the link to the result list
                System.out.println(link);
            //System.out.println("link print finished");
            result.add(link);
        }
        //System.out.println(link);
        it.next();
    }
    }
4

1 回答 1

1

您返回的链接没有任何问题。

查看您的代码,您正在提取href属性,在您的示例中,该属性来自元素:

<a  class=gbmt href="/preferences?hl=en">Search settings</a>

(点击右下角的“设置”可以看到这个链接,应该会弹出一个包含几个链接的列表)

如您所见,该href属性仅包含/preferences?hl=en,这只是使其成为相对链接。完整的 url 将是您当前所在页面的地址 + href。在这种情况下:

http://www.google.com/preferences?hl=en

如果 url 是相对的,您只需要调整代码以添加方法的参数。

于 2013-04-20T14:30:02.820 回答