linux - 为什么“uniq”将相同的词视为不同的词？

Question

我想计算文件中单词的频率，其中单词是一行一行的。该文件非常大，所以这可能是问题所在（在此示例中它计为 300k 行）。

我执行此命令：

cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt

问题是它给了我一个小错误：它认为相同的词是不同的。

例如，第一个条目是：

306 continua 
278 apertura 
211 eventi 
189 murah 
182 giochi 
167 giochi

giochi如您所见，重复两次。

在文件的底部，它变得更糟，看起来像这样：

  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 win 
  1 winchester 
  1 wind 
  1 wind

对于所有的话。

我究竟做错了什么？

score 13 · Accepted Answer

13

尝试先排序：

cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt

于 2012-08-08T08:24:34.087 回答

score 6 · Accepted Answer

6

Or use "sort -u" which also eliminates duplicates. See here.

于 2012-08-08T08:26:36.537 回答

score 3 · Accepted Answer

文件的大小与您所看到的无关。从 uniq(1) 的手册页：

注意：'uniq' 不会检测重复的行，除非它们是相邻的。您可能想先对输入进行排序，或者使用不带 'uniq' 的 'sort -u'。此外，比较遵循“LC_COLLATE”指定的规则。

所以uniq继续运行

a
b
a

将返回：

a
b
a

score 1 · Accepted Answer

Is it possible that some of the words have whitespace characters after them? If so you should remove them using something like this:

cat .temp_occ | tr -d ' ' | uniq -c | sort -k1,1nr -k2 > distribution.txt

4 回答 4