0

如何计算制表符分隔值(tsv)文件中的字符串实例?

tsv文件有几亿行,每一行的形式为

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

. 如何计算文件中整个第二列中每个唯一整数的实例,理想情况下将计数添加为每行中的第五个值?

foobar1  1  xxx   yyy  1
foobar1  2  xxx   yyy  2
foobar2  2  xxx   yyy  2 
foobar2  3  xxx   yyy  2
foobar1  3  xxx   zzz  2

我更喜欢只使用 UNIX 命令行流处理程序的解决方案。

4

2 回答 2

1

我不完全清楚你想做什么。您是要根据第二列的值添加 0/1 作为第五列,还是要获取第二列中值的分布,整个文件的总数?

在第一种情况下,使用类似awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.

在第二种情况下,使用类似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.

于 2012-05-05T19:00:12.077 回答
0

一种使用perl假设第二列的值已排序的解决方案,我的意思是,当找到 value 时2,具有相同值的所有行将是连续的。该脚本保留行,直到它在第二列中找到不同的值,获取计数,打印它们并释放内存,因此无论输入文件有多大都不应产生问题:

内容script.pl

use warnings;
use strict;

my (%lines, $count);

while ( <> ) { 

    ## Remove last '\n'.
    chomp;

    ## Split line in spaces.
    my @f = split;

    ## Assume as malformed line if it hasn't four fields and omit it.
    next unless @f == 4;

    ## Save lines in a hash until found a different value in second column.
    ## First line is special, because hash will always be empty.
    ## In last line avoid reading next one, otherwise I would lose lines
    ## saved in the hash.
    ## The hash will ony have one key at same time.
    if ( exists $lines{ $f[1] } or $. == 1 ) { 
        push @{ $lines{ $f[1] } }, $_; 
        ++$count;
        next if ! eof;
    }   

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value.

    ## The value of the second column will be the key of the hash, get it now.
    my ($key) = keys %lines;

    ## Read each line of the hash and print it appending the repeated lines as
    ## last field.
    while ( @{ $lines{ $key } } ) { 
        printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
    }   

    ## Clear hash.
    %lines = (); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file.
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1;
}

内容infile

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

像这样运行它:

perl script.pl infile

具有以下输出:

foobar1  1  xxx   yyy   1
foobar1  2  xxx   yyy   2
foobar2  2  xxx   yyy   2
foobar2  3  xxx   yyy   2
foobar1  3  xxx   zzz   2
于 2012-05-08T08:41:40.150 回答