regex - awk：如何处理文件夹和子文件夹中的多个 files.txt？

Question

给定一个文件夹，其子文件夹本身带有多语言 .txt 文件，例如：

But where is Esope the holly Bastard
But where is 생 지 옥 이 군
지 옥 이
지 옥
지
我 是 你 的 爸 爸 ！
爸 爸 ！ ！ ！
你 不 會 的 ！

$ grep -o '\w*' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt

获得优雅：

1 생
1 군
1 Bastard
1 Esope
1 holly
1 the
1 不
1 我
1 是
1 會
2 이
2 But
2 is
2 where
2 你
2 的
3 옥
4 지
4 爸
5 ！

如何更改代码以处理文件夹及其子文件夹中的多个文件，所有文件都呈现类似的模式（至少 *.txt）？

score 4 · Accepted Answer

您可以使用该find命令。像这样：

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort

我正在使用该选项-exec来 cat 当前目录中的每个 *.txt 文件及其子目录。输出将通过管道传输到您的 grep|awk|sort 管道。

score 1 · Accepted Answer

使用 glob 就足够了。

awk '{a[$1]++}END{for(k in a)print a[k],k}' *.txt | sort > out.txt

或支持递归目录结构，您需要启用globstar选项并使用**：

shopt -s nullglob
awk '{a[$1]++}END{for(k in a)print a[k],k}' *.txt | sort > out.txt

您需要研究 - 方式awk来执行与类似的操作grep -o \w*，例如（/[[:alpha:]]+/部分）：

awk '/[[:alpha:]]+/{print $0}' *.txt

2 回答 2