csv - 如何计算制表符分隔值文件中的字符串实例？

Question

如何计算制表符分隔值（tsv）文件中的字符串实例？

tsv文件有几亿行，每一行的形式为

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

. 如何计算文件中整个第二列中每个唯一整数的实例，理想情况下将计数添加为每行中的第五个值？

foobar1  1  xxx   yyy  1
foobar1  2  xxx   yyy  2
foobar2  2  xxx   yyy  2 
foobar2  3  xxx   yyy  2
foobar1  3  xxx   zzz  2

我更喜欢只使用 UNIX 命令行流处理程序的解决方案。

score 1 · Accepted Answer

我不完全清楚你想做什么。您是要根据第二列的值添加 0/1 作为第五列，还是要获取第二列中值的分布，整个文件的总数？

在第一种情况下，使用类似awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.

在第二种情况下，使用类似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.

score 0 · Accepted Answer

一种使用perl假设第二列的值已排序的解决方案，我的意思是，当找到 value 时2，具有相同值的所有行将是连续的。该脚本保留行，直到它在第二列中找到不同的值，获取计数，打印它们并释放内存，因此无论输入文件有多大都不应产生问题：

内容script.pl：

use warnings;
use strict;

my (%lines, $count);

while ( <> ) { 

    ## Remove last '\n'.
    chomp;

    ## Split line in spaces.
    my @f = split;

    ## Assume as malformed line if it hasn't four fields and omit it.
    next unless @f == 4;

    ## Save lines in a hash until found a different value in second column.
    ## First line is special, because hash will always be empty.
    ## In last line avoid reading next one, otherwise I would lose lines
    ## saved in the hash.
    ## The hash will ony have one key at same time.
    if ( exists $lines{ $f[1] } or $. == 1 ) { 
        push @{ $lines{ $f[1] } }, $_; 
        ++$count;
        next if ! eof;
    }   

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value.

    ## The value of the second column will be the key of the hash, get it now.
    my ($key) = keys %lines;

    ## Read each line of the hash and print it appending the repeated lines as
    ## last field.
    while ( @{ $lines{ $key } } ) { 
        printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
    }   

    ## Clear hash.
    %lines = (); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file.
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1;
}

内容infile：

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

像这样运行它：

perl script.pl infile

具有以下输出：

foobar1  1  xxx   yyy   1
foobar1  2  xxx   yyy   2
foobar2  2  xxx   yyy   2
foobar2  3  xxx   yyy   2
foobar1  3  xxx   zzz   2

csv - 如何计算制表符分隔值文件中的字符串实例？

2 回答 2

Related

Reference