awk - awk：CJK 字符有什么问题？＃韩国人

Question

给定一个 .txt 文件，其中包含空格分隔的单词，例如：

But where is Esope the holly Bastard
But where is 생 지 옥 이 군
지 옥 이
지 옥
지
我 是 你 的 爸 爸 ！
爸 爸 ！ ！ ！
你 不 會 的 ！

和awk 函数：

cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2" "$1}'

我在控制台中得到以下输出，这对韩语单词无效（对英语和中文空格分隔的单词有效）

생 16
Bastard 1
But 2
Esope 1
holly 1
is 2
the 1
where 2
不 1
你 2
我 1
是 1
會 1
爸 4
的 2

如何让它适用于韩语单词？ 注意：我实际上有 300.000 行和近 200 万字。

编辑：使用的答案：

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

score 2 · Accepted Answer

单个awk脚本可以轻松处理此问题，并且比您当前的管道效率更高：

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file 
옥 3
Bastard 1
！ 5
爸 4
군 1
지 4
But 2
會 1
你 2
the 1
是 1
不 1
이 2
Esope 1
的 2
holly 1
where 2
생 1
我 1
is 2

如果要将结果存储到另一个文件中，可以使用重定向，例如：

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file > outfile

awk - awk：CJK 字符有什么问题？＃韩国人

1 回答 1

Related

Reference