linux - 如何在awk中解析单词？

Question

我想知道如何解析如下所示的段落：

Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need

                                    * * * * * * *

Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.

CPL - 

  1. Combined Programming Language.  U Cambridge and U London.  A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset. 

Modula-3* - Incoprporation of Modula-2* ideas into Modula-3.  "Modula-3*:

所以我可以从 awk 语句中得到以下退出：

Autolisp
CPL
Modula-3*

我尝试了以下句子，因为我要过滤的文件很大。它是迄今为止所有现有编程语言的列表，但基本上所有行都遵循与上面相同的模式

到目前为止我用过的句子：

BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }

BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}

BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}

BEGIN{NF == 2 && $2 == "-" } { print $1 }

BEGIN { RS = "" } { print $1 }

到目前为止对我有用的句子是：

BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }

awk -F " - " "/ - /{ print $1 }" file.txt

但它仍然会打印或跳过我需要/不需要的行。

感谢您的帮助和回复！我已经打破了几天的头，因为我是 AWK 编程的菜鸟

score 3 · Accepted Answer

默认值FS应该没问题，以避免任何重复的行，您可以将输出传递到sort -u

$ gawk '$2 == "-"  { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*

它可能不会过滤掉您想要的所有内容，但您可以继续添加规则，直到过滤掉不良数据。

sort或者，您可以通过使用关联数组来避免使用：

$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file 
Autolisp
CPL
Modula-3*

score 1 · Accepted Answer

如果它不必与 awk 一起使用，则可能首先使用 grep 选择正确形式的行，然后使用 sed 修剪结尾，如下所示：

grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'

编辑：在玩了一些 awk 之后，看起来你的问题的一部分是你并不总是有 '[languagename] - [stuff]'，而是 '[languagename] -\n[stuff]'，原样示例文本中使用 CPL 的情况，因此 FS=" - " 不会在这样的事情上分开。

此外，一种可能的尝试如下：

BEGIN { r = "^.* -"; }
{
    if (match($0, r)) {
        printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
    }
}

我实际上对 awk 了解不多，但这是我复制 grep 和 sed 上面所做的最好的猜测。至少，它似乎确实适用于您提供的示例文本。

linux - 如何在awk中解析单词？

2 回答 2

Related

Reference