3

我有一个包含数据的巨大文件(~8Gb / ~8000 万条记录)。每条记录都有 6-8 个属性,这些属性由单个选项卡拆分。我想让初学者在另一个文件中复制一些给定的属性。所以我想要一个比上面更优雅的代码,例如,如果我只想要总共 4 个标记中的第二个和最后一个标记:

StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token

我提醒这是一个巨大的文件,所以我必须避免任何多余的 if 检查。

4

5 回答 5

3

你的问题让我想知道性能。最近我一直在尽可能地使用 Guava 的 Splitter,只是因为我挖掘了语法。我从未测量过性能,所以我对四种解析样式进行了快速测试。我很快就把这些放在一起,所以请原谅风格和边缘情况正确性方面的错误。它们基于我们只对第二项和第四项感兴趣的理解。

我发现有趣的是,在解析 350MB 制表符分隔的文本文件(有四列)时,“homeGrown”(非常粗略的代码)解决方案是最快的,例如:

head test.txt 
0   0   0   0
1   2   3   4
2   4   6   8
3   6   9   12

在我的笔记本电脑上运行超过 350MB 的数据时,我得到了以下结果:

  • 国产:2271ms
  • 番石榴分裂:3367ms
  • 正则表达式:7302ms
  • 标记化:3466ms

鉴于此,我想我会在大多数工作中坚持使用 Guava 的拆分器,并考虑为更大的数据集使用自定义代码。

  public static List<String> tokenize(String line){
    List<String> result = Lists.newArrayList();
    StringTokenizer st = new StringTokenizer(line, "\t");
    st.nextToken(); //get rid of the first token
    result.add(st.nextToken()); //show me the second token
    st.nextToken(); //get rid of the third token
    result.add(st.nextToken()); //show me the fourth token
    return result;
  }

  static final Splitter splitter = Splitter.on('\t');
  public static List<String> guavaSplit(String line){
    List<String> result = Lists.newArrayList();
    int i=0;
    for(String str : splitter.split(line)){
      if(i==1 || i==3){
        result.add(str);
      }
      i++;
    }
    return result;
  }

  static final Pattern p = Pattern.compile("^(.*?)\\t(.*?)\\t(.*?)\\t(.*)$");
  public static List<String> regex(String line){
    List<String> result = null;
    Matcher m = p.matcher(line);
    if(m.find()){
      if(m.groupCount()>=4){
        result= Lists.newArrayList(m.group(2),m.group(4));
      }
    }
    return result;
  }

  public static List<String> homeGrown(String line){
    List<String> result = Lists.newArrayList();
    String subStr = line;
    int cnt = -1;
    int indx = subStr.indexOf('\t');
    while(++cnt < 4 && indx != -1){
      if(cnt==1||cnt==3){
        result.add(subStr.substring(0,indx));
      }
      subStr = subStr.substring(indx+1);
      indx = subStr.indexOf('\t');
    }
    if(cnt==1||cnt==3){
      result.add(subStr);
    }
    return result;
  }

请注意,通过适当的边界检查和更优雅的实现,所有这些都可能会变慢。

于 2012-10-13T23:17:46.763 回答
0

You should probably use the unix cut utility, as Paul Tomblin says.

However, in Java you could also try:

String[] fields = line.split("\t");
System.out.println(fields[1]+" "+fields[3]);

Whether this is more 'elegant' is a matter of opinion. Whether it's faster on large files, I don't know - you would need to benchmark it on your system.

Relative performance will also depend on how many fields there are per line, and which fields you want; split() will process the whole line at once, but StringTokenizer will work through the line incrementally (good if you only want fields 2 and 4 out of 20, for example).

于 2012-10-13T21:00:58.713 回答
0

Although your data file is huge, it sounds like your question is more about how to conveniently access items in a line of text, where the items are separated by tab. I think StringTokenizer is overkill for a format this simple.

I would use some type of "split" to convert the line into an array of tokens. I prefer the StringUtils split in commons-lang over String.split, especially when a regular expression is not needed. Since a tab is "whitespace", you can use the default split method without specifying the delimiter:

String [] items = StringUtils.split(line);
if (items != null && items.length > 6)
{
    System.out.println("Second: " + items[1]  + "; Fourth: " + items[3]);
}
于 2012-10-13T21:01:51.077 回答
0

如果您正在执行 readLines,则实际上是在扫描文件两次:1)您一次搜索文件 1 个字符以查找行尾字符 2)然后扫描每一行以查找 Tabs。

您可以查看其中一个 Csv 库。从记忆中,flatpack 只进行一次扫描。这些库可能会提供更好的性能(尽管我从未测试过)。

几个 java 库: - Java Csv 库 - flatpack

于 2012-10-13T22:03:21.077 回答
0

如果您的文件很大,除了速度之外,您还将面临内存消耗问题,因为您必须将文件加载到内存中才能对其进行操作。

我有一个想法,但请注意这是特定于平台的并且违反了 Java 移动性。

您可以从 java 运行 unix 命令以获得大量的速度和内存消耗。例如:

    public static void main ( final String[] args)throws Exception {
         Runtime.getRuntime().exec("cat <file> | awk {print $1} >> myNewFile.txt");
    }
于 2012-10-14T00:24:57.030 回答