java - 为什么网站爬行需要永远？

Question

public class Parser {

    public static void main(String[] args) {
        Parser p = new Parser();
        p.matchString();
    }

    parserObject courseObject = new parserObject();
    ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
    ArrayList<String> courseNames = new ArrayList<String>();
    String theWebPage = " ";

    {
        try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader =
                new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine()) != null) {
                theWebPage = theWebPage + " " + str;
            }
            reader.close();

        } catch (MalformedURLException e) {
            // do nothing
        } catch (IOException e) {
            // do nothing
        }
    }

    public void matchString() {
        // this is my regex that I am using to compare strings on input page
        String matchRegex = "#\\w+(-\\w+)+";

        Pattern p = Pattern.compile(matchRegex);
        Matcher m = p.matcher(theWebPage);

        int i = 0;
        while (!m.hitEnd()) {
            try {
                System.out.println(m.group());
                courseNames.add(i, m.group());
                i++;
            } catch (IllegalStateException e) {
                // do nothing
            }
        }
    }
}

我想用上面的代码实现的是在 MIT OpencourseWare 网站上获取部门列表。我正在使用与页面源中的部门名称模式匹配的正则表达式。我正在使用 Pattern 对象和 Matcher 对象并尝试 find() 并打印这些与正则表达式匹配的部门名称。但是代码需要永远运行，我不认为使用 bufferedReader 在网页中阅读需要那么长时间。所以我认为我要么做错了什么，要么解析网站需要很长时间。因此，如果有任何关于如何提高性能或纠正我的代码中的错误的意见，我将不胜感激。我为写得不好的代码道歉。

score 13 · Accepted Answer

问题出在代码上

while ((str = reader.readLine()) != null)
    theWebPage = theWebPage + " " +str;

变量theWebPage是一个字符串，它是不可变的。对于读取的每一行，此代码创建一个新字符串，其中包含到目前为止已读取的所有内容的副本，并附加一个空格和刚刚读取的行。这是大量不必要的复制，这就是程序运行如此缓慢的原因。

我下载了有问题的网页。它有 55,000 行，大小约为 3.25MB。不会太大。但是由于循环中的复制，第一行最终被复制了大约15 亿次（55,000 平方的 1/2）。该程序将所有时间都用于复制和垃圾收集。我在我的笔记本电脑（2.66GHz Core2Duo，1GB 堆）上运行它，从本地文件读取时运行了 15 分钟（没有网络延迟或网络爬取对策）。

要解决此问题，请theWebPage改为 a StringBuilder，并将循环中的行更改为

    theWebPage.append(" ").append(str);

如果您愿意，可以在循环之后使用转换theWebPage为字符串。toString()当我运行修改后的版本时，只花了几分之一秒。

顺便说一句，您的代码在类内使用裸代码块{ }。这是一个实例初始化程序（与静态初始化程序相反）。它在对象构建时运行。这是合法的，但很不寻常。请注意，它误导了其他评论者。我建议将此代码块转换为命名方法。

score 2 · Accepted Answer

这是你的整个程序吗？的声明在哪里parserObject？

另外，所有这些代码不应该在你main()调用之前matchString()吗？

parserObject courseObject = new parserObject();
ArrayList<parserObject>  courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage=" ";
{

    try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine())!=null)
            {
                theWebPage = theWebPage+" "+str;
            }
            reader.close();

    } catch (MalformedURLException e) {

    } catch (IOException e) {

    }
}

您还在捕获异常并且不显示任何错误消息。当遇到异常时，您应该始终显示错误消息并执行某些操作。例如，如果您无法下载页面，则没有理由尝试解析空字符串。

从您的评论中，我了解了类中的静态块（谢谢，不知道它们）。但是，根据我的阅读，您需要将关键字放在staticblock 的开头{。此外，最好将代码放入您的 .xml 文件main中，这样如果您收到 MalformedURLException 或 IOException，您可以退出。

score 1 · Accepted Answer

当然，您可以使用有限的 JDK 1.0 API 解决此任务，并遇到Stuart Marks 在他的出色回答中帮助您解决的问题。

或者，您只需使用流行的事实上的标准库，例如Apache Commons IO，然后使用这样的简单方法将您的网站读入字符串：

// using this...
import org.apache.commons.io.IOUtils;

// run this...
try (InputStream is = new URL("http://ocw.mit.edu/courses/").openStream()) {
    theWebPage = IOUtils.toString(is);
}

java - 为什么网站爬行需要永远？

3 回答 3

Related

Reference