regex - 如何使用 egrep 列出与正则表达式匹配的单词？

Question

我需要使用 egrep 来计算包含与正则表达式匹配的字符串的单词。例如，我需要做类似“计算包含三个连续元音的单词数”（不完全是这样，但这就是它的要点）。

我已经想出了如何计算包含这些单词的行，但是当我添加 -w 标签时出现egrep: illegal option -- w错误。

这是我在上面的场景中用来计算行数的正则表达式，它似乎有效：

egrep -i -c '[aeiou][aeiou][aeiou]' full.html

-w即使我在正则表达式周围添加 \b 标记，使用此命令的标记也会导致我上面列出的错误。例如：

egrep -i -c -w '\b.*[aeiou][aeiou][aeiou].*\b' full.html

我究竟做错了什么？

编辑：我在终端的 Solaris 10 上运行它。

score 5 · Accepted Answer

也可以使用这种方式来查找包含字符串的单词的计数

grep --color -Eow '[aeiou][aeiou][aeiou]' filename | wc -l

或者

egrep -ow '[aeiou][aeiou][aeiou]' filename | wc -l

o 仅打印匹配的。

w 为单词。

最后，它将显示单词的计数。

score 1 · Accepted Answer

您必须查阅您的 solaris 手册页以了解您的 egrep 是否支持任何/所有/某些类似 GNU 的扩展。

你的系统有 /usr/xpg4/bin 吗？如果是，请确保您的 MANPATH 包含 /usr/xpg4/man。该目录曾经有最新版本，没有添加 /opt/gnu install 之类的东西。

无论如何，您的正则表达式'\b.*[aeiou][aeiou][aeiou].*\b'在我看来是...

1 word-boundary
followed by any number of any chars (including blanks and vowels) 
followed by three vowels, 
followed by any number of any chars (including blanks and vowels), 
followed by 1 word-boundary.

可能不是你真正想要的。

为了满足您对连续 3 个元音单词的需求并使用旧/方形正则表达式长手，请尝试

 egrep -i -c '[a-z]*[aeiou][aeiou][aeiou][a-z]*' full.html

这就是说，匹配 chars [az] 任意数字（包括 none），在 3 个元音之前，后跟任意数量的 chars [az]（包括 none）。所以空格字符不会匹配 [az]。您正在使用 -i 忽略大小写，因此您不必使用[A-Za-z]. 显然，如果您发现您想将其视为单词字符的其他字符，也许是'_'char?，请将其添加到双方。

对不起，但我是从记忆中走出来的，我不在 Solaris 商店工作，也无法在那里进行测试。

编辑

另请注意，我当前系统上的 grep 手册页说

  -c, --count
          Suppress normal output; instead print a count of matching  lines
          for  each  input  file.  With the -v, --invert-match option (see
          below), count non-matching lines.

请注意，这是匹配行数，而不是匹配数。

可能更容易使用

  awk '{for (i=1;i<=NF;i++){if ($i ~ /.*[aeiou][aeiou][aeiou].*/) cnt++};}; END{print "count="cnt}'file

IHTH

score 0 · Accepted Answer

我相信 egrep 不支持\b单词边界。尝试\<单词边界的开头和单词边界\>的结尾。

编辑
嗯...没关系。根据手册页 \b是支持的。

实际上，我认为答案是只有 grep 支持“-w”选项。我不认为 egrep 可以。 http://ss64.com/bash/egrep.html

score 0 · Accepted Answer

哪个平台和哪个版本的 egrep？

-w 选项对我有用（CentOS 和带有 GNU egrep 的 Mac）——见下文。此外，\b按预期工作 - 见下文。

此外，我使用了不同的正则表达式 - 见下文。

$ grep --version
grep (GNU grep) 2.5.1

$ cat test.txt 
this and that iou and eai
not this aaih
not this haai

$ egrep -i -w '[aeiou]{3}' test.txt 
this and that iou and eai

# with no -w
egrep -i '\b[aeiou]{3}\b' test.txt
this and that iou and eai

# with neither -w nor {3}
$ egrep -i '\b[aeiou][aeiou][aeiou]\b' /tmp/test.txt 
this and that iou and eai

# using '\<' and '\>' works as well for word boundaries
$ egrep -i '\<[aeiou][aeiou][aeiou]\>' /tmp/test.txt 
this and that iou and eai

regex - 如何使用 egrep 列出与正则表达式匹配的单词？

4 回答 4

Related

Reference