java - 从网页中获取所有链接

Question

我想在执行 GET 后获取页面中的所有链接，我的代码适用于某些网站，但不适用于其他网站，但在调试时显示未找到匹配项，并且它永远不会进入 while 循环，尽管该网站中有链接

  Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)",               
    Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher pageMatcher = linkPattern.matcher(Content);

    if (FindKeyword(Content)) {
        LinksWithKey.add(HostName);
    }
        count++;

    while (pageMatcher.find()) {

score 0 · Accepted Answer

就像评论中所说的一样，您应该考虑使用JSoup来完成这样的任务。

Document doc = Jsoup.parse(Content); // this is your original HTML content
for (Element link : doc.select("a[href]")) {
    System.out.println(link.attr("href"));
}

java - 从网页中获取所有链接

1 回答 1

Related

Reference