java - 从网页中获取所有超链接并在java中递归地执行此操作

Question

1 .Fetch all contents from a Webpage
2. fetch hyperlinks from the webpage.
3. Repeat the 1 & 2 from the fetched hyperlink
4. repeat the process untill 200 hyperlinks regietered or no more hyperlink to fetch.

我写了一个示例程序，但由于对递归的理解不足，我的循环变成了无限循环。建议我解决符合预期的代码。

import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Content
{
    private static final String HTML_A_HREF_TAG_PATTERN = 
        "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
    Pattern pattern;
    public Content ()
    {
        pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
    }

    private void fetchContentFromURL(String strLink) {
        String content = null;
        URLConnection connection = null;
        try {
          connection =  new URL(strLink).openConnection();
          Scanner scanner = new Scanner(connection.getInputStream());
          scanner.useDelimiter("\\Z");
          content = scanner.next();
        }catch ( Exception ex ) {
            ex.printStackTrace();
            return;
        }
        fetchURL(content);
    }

    private void fetchURL ( String content )
    {
        Matcher matcher = pattern.matcher( content );
        while(matcher.find()) {
            String group = matcher.group();
            if(group.toLowerCase().contains( "http" ) || group.toLowerCase().contains( "https" )) {
            group = group.substring( group.indexOf( "=" )+1 );
            group = group.replaceAll( "'", "" );
            group = group.replaceAll( "\"", "" );
            System.out.println("lINK "+group);
            fetchContentFromURL(group);
            }
        }
        System.out.println("DONE");
    }

    /**
     * @param args
     */
    public static void main ( String[] args )
    {

        new Content().fetchContentFromURL( "http://www.google.co.in" );
    }

}

我也对任何其他解决方案持开放态度，但只想坚持使用核心 java Api 而不是第 3 方。

score 2 · Accepted Answer

这里一种可能的选择是记住所有访问过的链接以避免循环路径。以下是如何为已访问的链接使用额外的 Set 存储来归档它：

public class Content {
private static final String HTML_A_HREF_TAG_PATTERN =
        "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private Pattern pattern;
private Set<String> visitedUrls = new HashSet<String>();

public Content() {
    pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
}

private void fetchContentFromURL(String strLink) {
    String content = null;
    URLConnection connection = null;
    try {
        connection = new URL(strLink).openConnection();
        Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        if (scanner.hasNext()) {
            content = scanner.next();
            visitedUrls.add(strLink);
            fetchURL(content);
        }
    } catch (Exception ex) {
        ex.printStackTrace();
    }
}

private void fetchURL(String content) {
    Matcher matcher = pattern.matcher(content);
    while (matcher.find()) {
        String group = matcher.group();
        if (group.toLowerCase().contains("http") || group.toLowerCase().contains("https")) {
            group = group.substring(group.indexOf("=") + 1);
            group = group.replaceAll("'", "");
            group = group.replaceAll("\"", "");
            System.out.println("lINK " + group);
            if (!visitedUrls.contains(group) && visitedUrls.size() < 200) {
                fetchContentFromURL(group);
            }
        }
    }
    System.out.println("DONE");
}

/**
 * @param args
 */
public static void main(String[] args) {
    new Content().fetchContentFromURL("http://www.google.co.in");
}

}

我还修复了获取逻辑中的其他一些问题，现在它可以按预期工作。

score 1 · Accepted Answer

在 fetchContentFromURL 方法中，您应该记录当前正在获取的 url，如果该 url 已经被获取，则跳过它。否则两个页面 A，B，它们之间有一个链接点，会导致你的代码不断获取。

score 1 · Accepted Answer

除了 JK1 的回答之外，为了实现问题的目标 4，您可能希望将超链接的计数保持为实例变量。一个粗略的伪代码可能是（您可以调整确切的计数。另外，您可以使用 HashSet 长度来了解您的程序到目前为止已解析的超链接的数量）：

if (!visitedUrls.contains(group) && noOfHyperlinksVisited++ < 200) {
            fetchContentFromURL(group);
}

但是，我不确定您是想要总共 200 个超链接还是想要从起始页面遍历到 200 个链接的深度。如果是稍后，您可能希望探索广度优先搜索，它会在您达到目标深度时通知您。

java - 从网页中获取所有超链接并在java中递归地执行此操作

3 回答 3

Related

Reference