我有一个这样的文件:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
我想生成一个两列的列表。第一列显示出现的单词,第二列显示它们出现的频率,例如:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
- 为了使这项工作更简单,在处理列表之前,我将删除所有标点符号,并将所有文本更改为小写字母。
- 除非有一个简单的解决方案,
words
并且word
可以算作两个单独的单词。
到目前为止,我有这个:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
出于某种原因,这仅在每个单词后显示“0”。
如何生成文件中出现的每个单词的列表以及频率信息?