linux - 如何在大文件中查找唯一行？

Question

我有一个大文件（40 亿行），每行包含一个单词。我想找到唯一单词的列表和相应的计数。

我试过：

sort largefile |uniq -c >outfile

但它仍在运行并且没有输出。

然后我尝试了：

awk '!arr[$1]++' largefile >outfile

但它不打印计数。我怎样才能使用 awk 来打印计数？或任何其他可以处理大文件的替代方法。

编辑：文件中有大约 1700 万个唯一词。

score 3 · Accepted Answer

3

你的第一个例子是好的。试着做：

sort largefile | uniq -c

这需要一些时间。

于 2013-08-13T00:37:43.500 回答

score 3 · Accepted Answer

uniq是要走的路，但 40 亿行只是很多行。如果您经常需要这些信息，我会设置一个数据库并导入行。这将加速基于索引的查询。但是，40 亿行是很多行

score 2 · Accepted Answer

使用 split -l 1000000 预处理文件，将文件拆分为 40 个 1,000,000 行文件，使用 sort -u 对它们进行排序。将它们合并回一个大文件，再次排序 -u 。

## if you use just this, you need gawk 4.1
#
gawk '{key[$0]++;} END { for( word in key ) { print key[word] "\t" word }}' bigfile


cd to directory with bigfile
split -l 1000000 bigfile aa, small    ## makes smallaa, smallab, etc.

for files in small*
do
  echo "Sorting file $files"
  sort -u $files -o $files.srt
done

sort -m *.srt -o bigagain
sort -u bigagain > smallish

now have words but no counts.
gawk '{key[$0]++;}' smallish bigfile   # or better yet

score 1 · Accepted Answer

文件有多大？你期待多少个独特的词？在大多数情况下，您的sort | uniq解决方案是一个好的开始，但显然如果文件太大，那就不好了。将每个单词保存在散列中的 Perl 脚本可能对您有用。

这是未经测试的，来自记忆，所以它可能有一堆错误......

my %words = ();
open(IN, "<", "yourfile") or die "Arrgghh file didn't open: $!";
while(<IN>) {
    chomp;
    $words{$_}++;
}
close(IN);
for my $k in (keys %words) {
    print "$k $words{$k}\n";
}

linux - 如何在大文件中查找唯一行？

4 回答 4

Related

Reference