5

我有两个文件,一个包含单词长度为 3 到 6 的字典和一个包含单词 7 的字典。单词存储在用换行符分隔的文本文件中。此方法加载文件并将其插入到我存储在应用程序类中的数组列表中。

文件大小为 386 KB 和 380 KB,每个包含少于 200k 字。

private void loadDataIntoDictionary(String filename) throws Exception {
    Log.d(TAG, "loading file: " + filename);
    AssetFileDescriptor descriptor = getAssets().openFd(filename);
    FileReader fileReader = new FileReader(descriptor.getFileDescriptor());
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String word = null;

    int i = 0;

    MyApp appState = ((MyApp)getApplicationContext());

    while ((word = bufferedReader.readLine()) != null) {
        appState.addToDictionary(word);
        word = null;
        i++;
    }
    Log.d(TAG, "added " + i + " words to the dictionary");

    bufferedReader.close();
}

该程序在运行 2.3.3 且具有 64MB sd 卡的模拟器上崩溃。使用 logcat 报告的错误。堆增长超过 24 MB。然后我看到将目标 GC 堆从25.XXX24.000 MB 固定。

GC_FOR_MALLOC 释放 0K,释放 12%,外部 1657k/2137K,暂停 208ms。
GC_CONCURRENT 释放 XXK,释放 14%
内存不足 24 字节分配,然后是 FATAL EXCEPTION,内存耗尽。

How can I load these files without getting such a large heap?

Inside MyApp:

private ArrayList<String> dictionary = new ArrayList<String>();
public void addToDictionary(String word) {
    dictionary.add(word);
}
4

1 回答 1

1

Irrespective of any other problems/bugs, ArrayList can be very wasteful for this kind of storage, because as a growing ArrayList runs out of space, it doubles the size of its underlying storage array. So it's possible that nearly half of your storage is wasted. If you can pre-size a storage array or ArrayList to the correct size, then you may get significant saving.

Also (with paranoid data-cleansing hat on) make sure that there's no extra whitespace in your input files - you can use String.trim() on each word if necessary, or clean up the input files first. But I don't think this can be a significant problem given the file sizes you mention.

I'd expect your inputs to take less than 2MB to store the text itself (remember that Java uses UTF-16 internally, so would typically take 2 bytes per character) but there's maybe 1.5MB overhead for the String object references, plus 1.5MB overhead for the String lengths, and possibly the same again and again for the offset and hashcode (take a look at String.java)... whilst 24MB of heap still sounds a little excessive, it's not far off if you are getting the near-doubling effect of an unlucky ArrayList re-size.

In fact, rather than speculate, how about a test? The following code, run with -Xmx24M gets to about 560,000 6-character Strings before stalling (on a Java SE 7 JVM, 64-bit). It eventually crawls up to around 580,000 (with much GC thrashing, I imagine).

    ArrayList<String> list = new ArrayList<String>();
    int x = 0;
    while (true)
    {
        list.add(new String("123456"));
        if (++x % 1000 == 0) System.out.println(x);
    }

So I don't think there's a bug in your code - storing large numbers of small Strings is just not very efficient in Java - for the test above it takes over 7 bytes per character because of all the overheads (which may differ between 32-bit and 64-bit machines, incidentally, and depend on JVM settings too)!

You might get slightly better results by storing an array of byte arrays, rather than ArrayList of Strings. There are also more efficient data structures for storing strings, such as Tries.

于 2012-10-28T23:05:07.867 回答