我有一个 ASCII 文本文件。我想使用一个或多个 Ubuntu 命令从该文件中生成所有“单词”的列表。单词被定义为分隔符之间的字母数字序列。分隔符默认为空格,但我也想尝试其他字符,如标点符号等。换句话说,我希望能够指定分隔符字符集。我如何只产生一组独特的单词?如果我还想仅列出至少 N 个字符长的单词怎么办?
3 回答
你可以使用 grep:
-E '\w+'
搜索单词
-o
仅打印匹配的行部分% cat temp
一些示例使用“The quick brown fox jumped over the lazy dog”而不是“Lorem ipsum dolor sit amet, consectetur adipiscing elit”作为示例文本。
如果你不在乎单词是否重复
% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text
如果您只想打印每个单词一次,不考虑大小写,您可以使用 sort
-u
每个单词只打印一次
-f
告诉排序在比较单词时忽略大小写
如果你只想要每个单词一次
% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use
你也可以使用tr
命令
echo the quick brown fox jumped over the lazydog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazydog
The -c
is for the complement of the specified characters; the -s
squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics, if you add a character here, the input won't get delimited on that character (see another example below); the '\n' is the replacement character (newline).
echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9-' '\n'
the
quick
brown
fox
jumped
over
the
lazy-dog
As we added '-' in the list of non-delimiters list, lazy-dog was printed. Other the output is
echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazy
dog
Summary for tr: any character not in argument of -c
, will act as a delimiter. I hope this solves your delimiter problem too.
Here's my word-cloud like chain
cat myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr
if you have a tex file, replace cat
with detex
:
detex myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr
这应该对你有用:
tr \ \\t\\v\\f\\r \\n | | tr -s \\n | tr -dc a-zA-Z0-9\\n | LC_ALL=C sort | uniq
如果您想要至少五个字符长的字符,请通过管道输出tr
through grep .....
。如果您想要不区分大小写,tr A-Z a-z
请在sort
.
请注意,这是正常工作LC_ALL=C
所必需的。sort
我建议在man
这里阅读您不理解的 ant 命令页面。