java - 字符串从内存中分离出来

Question

我有大量制表符分隔的文本数据，格式为DATE NAME MESSAGE. 我的意思是，1.76GB 的集合分为 1075 个实际文件。我必须从所有文件中获取NAME数据。直到现在我有这个：

   File f = new File(directory);
        File files[] = f.listFiles();
        // HashSet<String> all = new HashSet<String>();
        ArrayList<String> userCount = new ArrayList<String>();
        for (File file : files) {
            if (file.getName().endsWith(".txt")) {
                System.out.println(file.getName());
                BufferedReader in;
                try {
                    in = new BufferedReader(new FileReader(file));
                    String str;
                    while ((str = in.readLine()) != null) {
                        // if (all.add(str)) {
                        userCount.add(str.split("\t")[1]);
                        // }

                        // if (all.size() > 500)
                        // all.clear();
                    }
                    in.close();
                } catch (IOException e) {
                    System.err.println("Something went wrong: "
                            + e.getMessage());
                }

            }
        }

即使使用 -Xmx1700，我的程序也总是出现内存不足异常。我不能超越。无论如何我可以优化代码以便它可以处理ArrayList<String>sNAME吗？

score 3 · Accepted Answer

由于您似乎允许使用 Java 之外的替代解决方案，所以这里有一个 awk 应该处理它。

cat *.txt | awk -F'\t' '{sum[$2] += 1} END {for (name in sum) print name "," sum[name]}'

解释：

-F'\t' - separate on tabs
sum[$2] += 1 - increment the value for the second element (name)

关联数组使这非常简洁。在我创建的测试文件上运行它，如下所示：

import random

def main():
    names = ['Nick', 'Frances', 'Carl']
    for i in range(10000):
        date = '2012-03-24'
        name = random.choice(names)
        message = 'asdf'
        print '%s\t%s\t%s' %(date, name, message)

if __name__ == '__main__':
    main()

我得到结果：

Carl,3388
Frances,3277
Nick,3335

score 1 · Accepted Answer

您可以采取一些措施来改善代码的内存占用和总体性能：

在继续下一个之前关闭您的 FileReader 对象。 FileReader是一个 InputStreamReader，需要调用close()它才能释放资源。您当前的代码有效地为您正在查看的每个文件保持打开流。

for( File file: files ) {
    BufferedReader in = null;
    try{
        in = new BufferedReader( new FileReader( file ) );
        // TODO do whatever you want here.
    }
    finally{
        if( in != null ) {
            in.close();
        }
    }
}

如果可能，请避免将所有NAME值存储在userCountArrayList 中。就像 ARS 建议的那样，您可以先将此信息写入另一个文件，然后在需要再次提取该数据时读取该文件。如果这不是一个有吸引力的选择，您仍然可以将信息写入 OutputStream，然后将其通过管道传输到应用程序中其他地方的 InputStream。这会将您的数据保存在内存中，但无论您在哪里使用NAME值列表，都可以同时开始处理/显示/任何内容，因为您继续阅读这些 1,000 多个文件以搜索更多NAME值。
使用listFiles(FileFilter)方法，以便 Java 可以为您过滤掉非文本文件。这应该可以防止一些额外的 CPU 周期，因为您将不再需要在消除它们之前迭代具有不正确扩展名的文件。

score 1 · Accepted Answer

String.split 返回内部使用与原始字符串相同的字符数组的字符串。未使用的字符不会被垃圾收集。

尝试使用 new String(str.split("\t")[1]) 强制分配新数组。

java - 字符串从内存中分离出来

3 回答 3

Related

Reference