java - 使用 indexOf、子字符串与模式匹配从核心 Java 中的网页中提取链接

Question

我正在尝试使用核心 java 获取网页中的链接。我正在遵循从网页中提取链接中给出的以下代码，并进行了一些修改。

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        }

关于提取每个链接，上述帖子中的大多数答案都建议使用模式匹配。但是，据我了解，模式匹配是一项昂贵的操作。所以我想使用 indexOf 和 substring 操作从每一行获取链接文本，如下所示

   private static Set<String> getUrls(String line, int firstIndexOfHref) {
        int startIndex = firstIndexOfHref;
        int endIndex;
        Set<String> urls = new HashSet<>();

        while(startIndex != -1) {
            try {
                endIndex = line.indexOf("\"", startIndex + 6);
                String url = line.substring(startIndex + 6, endIndex);
                urls.add(url);
                startIndex =  line.indexOf("href=\"http", endIndex);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        return urls;
    }

我在几页上试过这个，它工作正常。但是我不确定这种方法是否总是有效。我想知道这个逻辑在某些实时场景中是否会失败。

请帮忙。

score 1 · Accepted Answer

您的代码在一行中依赖于良好的 html 格式，它不会处理各种其他引用方式，<a href例如单引号、无引号、额外的空格，包括“a”和“href”和“=”之间的新行，相对路径，其他协议，例如 file: 或 ftp:。

您需要考虑的一些示例：

<a href 
   =/questions/63090090/extract-links-from-a-web-page-in-core-java-using-indexof-substring-vs-pattern-m

或者

<a href = 'http://host'

或者

<a 
href = 'http://host'

这就是为什么另一个问题有很多答案，包括 HTML 验证器和正则表达式模式。

java - 使用 indexOf、子字符串与模式匹配从核心 Java 中的网页中提取链接

1 回答 1

Related

Reference