sorting - 大数据文件的排序和求和

Question

我必须处理一个sort似乎无法处理的文件。这些文件是 apprx。每个 3 GB。

输入如下：

last-j  nmod+j+n    year-n 9492
last-j  nmod+j+n    night-n 8075
first-j nmod+j+n-the    time-n 7749
same-j  nmod+j+n-the    time-n 7530
other-j nmod+j+n-the    hand-n 5319
ast-j   nmod+j+n   year-n 1000
last-j   nmod+j+n   night-n 5000
first-j   nmod+j+n-the   time-n 1000
same-j   nmod+j+n-the   time-n 3000
other-j   nmod+j+n-the   hand-n 200

我需要在其中总结相应重复项的数量。

所以所需的输出如下：

   last-j   nmod+j+n    year-n 10492
    last-j  nmod+j+n    night-n 13075
    first-j nmod+j+n-the    time-n 8749
    same-j  nmod+j+n-the    time-n 10530
    other-j nmod+j+n-the    hand-n 5519

我一直在尝试这个排序命令，它应该可以解决问题

sort input | uniq -c | awk '{print $2 "\t" $3 "\t" $1*$4}'

并且内存不足。有什么建议可以更优化以处理更大的数据文件吗？谢谢

score 2 · Accepted Answer

sort和其他纯粹神奇的 UNIX 工具已尽可能优化——可能——可以。如果您正在计算文件中的条目，并且它们的唯一出现不适合内存，那么将它们加载到内存中将不是一个好的解决方案 - 这是最快的方法，否则。

除此之外，对文件进行排序O(n log n)——然后对条目进行计数O(n)——肯定是最好的解决方案——除非你k在内存中保留一个 -size 的条目映射，并继续将数据从内存交换到磁盘每当k + 1尝试将键添加到地图时。考虑到这一点，您的解决方案（带有的单线sort + uniq + awk）只需要轻轻一点。

尝试sort从外部访问文件，使用sort's 的神奇能力这样做；之后，计数最多需要在内存中保存一个条目——这几乎可以解决您的问题。最后的两条线可能是这样的：

sort -T <directory_for_temp_files> <input> > <output>
awk '{
    if (cur == "$1 $3") { freq += $4; }
    else { printf "%s %d\n", cur, freq; cur = "$1 $3"; freq = $4; }
}' < <output> > <final_output>

score 2 · Accepted Answer

使用数组awk可以一起完成所有操作，无需sortand uniq：

$ awk '{a[$1,$2,$3]+=$4} END{for (i in a) print i, a[i]}' file
first-jnmod+j+n-thetime-n 8749
ast-jnmod+j+nyear-n 1000
same-jnmod+j+n-thetime-n 10530
last-jnmod+j+nnight-n 13075
last-jnmod+j+nyear-n 9492
other-jnmod+j+n-thehand-n 5519

因为这是使用 col 1, 2, 3 作为索引，所以它们一起写。将它们放在另一个数组中可以解决这个问题：

$ awk '{a[$1,$2,$3]+=$4; b[$1,$2,$3]=$1" "$2" "$3} END{for (i in a) print b[i], a[i]}' a
first-j nmod+j+n-the time-n 8749
ast-j nmod+j+n year-n 1000
same-j nmod+j+n-the time-n 10530
last-j nmod+j+n night-n 13075
last-j nmod+j+n year-n 9492
other-j nmod+j+n-the hand-n 5519

score 1 · Accepted Answer

如果这是内存不足，那是因为sortasuniq并且awk只消耗恒定数量的内存。您可以使用 GNU 并行并行运行多种排序，例如从手册中：

cat bigfile | parallel --pipe --files sort | parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort

这里 bigfile 被分成大约 1MB 的块，每个块以 '\n' 结尾（这是 --recend 的默认值）。每个块都被传递给排序，排序的输出被保存到文件中。这些文件在删除文件之前被传递到对文件运行 sort -m 的第二个并行。输出保存到 bigfile.sort。

对文件进行排序后，您可以通过您正在使用的uniq/awk管道对其进行流式传输，例如：

cat bigfile.sort | uniq -c | awk '{print $2 "\t" $3 "\t" $1*$4}'

sorting - 大数据文件的排序和求和

3 回答 3

Related

Reference