我编写了一个小型爬虫,发现它的堆空间不足(尽管我目前将列表中的 URL 数量限制为 300 个)。
使用 Java 内存分析器,我发现消费者是char[]
(64MB 中的 45MB,或者如果我增加允许的大小也会更多;它只是不断增长)。
分析器还给了我char[]
. 它包含爬虫读取的 HTML 页面。
通过对不同设置的更深入分析,-Xmx[...]m
我发现 Java几乎使用了所有可用空间,然后在out of heap
我想下载 3MB 大小的图像时立即获取。
当我给 Java 16MB 时,它使用 14MB 并且失败,当我给它 64MB 时,它使用 59MB 并且在尝试下载大图像时失败。
阅读页面是用这段代码完成的(编辑和添加.close()
):
private String readPage(Website url) throws CrawlerException {
StringBuffer sourceCodeBuffer = new StringBuffer();
try {
URLConnection con = url.getUrl().openConnection();
con.setConnectTimeout(2000);
con.setReadTimeout(2000);
BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream()));
String strTemp = "";
try {
while(null != (strTemp = br.readLine())) {
sourceCodeBuffer = sourceCodeBuffer.append(strTemp);
}
} finally {
br.close();
}
} catch (IOException e) {
throw new CrawlerException();
}
return sourceCodeBuffer.toString();
}
另一个函数在while循环中使用返回的字符串,但据我所知,一旦字符串被下一页覆盖,就应该释放空间。
public void run() {
boolean stop = false;
while (stop == false) {
try {
Website nextPage = getNextPage();
String source = visitAndReadPage(nextPage);
List<Website> links = new LinkExtractor(nextPage).extract(source);
List<Website> images = new ImageExtractor(nextPage).extract(source);
// do something with links and images, source is not used anymore
} catch (CrawlerException e) {
logger.warning("could not crawl a url");
}
}
}
下面是分析器给我的输出示例。当我想查看哪些地方还需要这些char[]
时,分析器无法判断。所以我想它们不再需要了,应该被垃圾收集。由于它总是略低于最大空间,Java 似乎也进行了垃圾收集,但只是保持程序运行所需的量(不考虑可能会有大量输入)。
System.gc()
此外,每 5 秒甚至设置后显式调用一次source = null;
也不起作用。
只要有可能,网站代码似乎就会以任何方式存储。
我是否使用了类似于ObjectOutputStream
强制永久维护读取字符串的东西?或者 Java 怎么可能将这些网站保存Strings
在一个char[]
数组中这么久?
Class Name | Shallow Heap | Retained Heap | Percentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
char[60750] @ 0xb02c3ee0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.512 | 121.512 | 1,06%
char[60716] @ 0xb017c9b8 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.448 | 121.448 | 1,06%
char[60686] @ 0xb01f3c88 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.384 | 121.384 | 1,06%
char[60670] @ 0xb015ec48 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.352 | 121.352 | 1,06%
char[60655] @ 0xb01d5d08 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.328 | 121.328 | 1,06%
char[60651] @ 0xb009d9c0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.320 | 121.320 | 1,06%
char[60637] @ 0xb022f418 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.288 | 121.288 | 1,06%
编辑
用更大的内存测试后,我发现在dominator tree
Class Name | Shallow Heap | Retained Heap | Percentage
crawling.Website @ 0xa8d28cb0 | 16 | 759.776 | 0,15%
|- java.net.URL @ 0xa8d289c0 https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN... | 56 | 759.736 | 0,15%
| |- char[379486] @ 0xa8c6f4f8 <!DOCTYPE html><html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9"> <title>Google Accounts</title><style type="text/css"> html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl, dt, dd, ol, ul, li, t... | 758.984 | 758.984 | 0,15%
| |- java.lang.String @ 0xa8d28a40 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...| 24 | 624 | 0,00%
| | '- char[293] @ 0xa8d28a58 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl... | 600 | 600 | 0,00%
| |- java.lang.String @ 0xa8d289f8 c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...| 24 | 24 | 0,00%
| |- java.lang.String @ 0xa8d28a10 www.google.com | 24 | 24 | 0,00%
| |- java.lang.String @ 0xa8d28a28 /recaptcha/api/image | 24 | 24 | 0,00%
从我的意图来看,我真的很想知道:为什么 HTML 源代码是java.net.URL
? 这是否来自我打开的 URLConnection?