windows - awk 的 gsub 问题（gawk）

Question

我需要在文本文件中搜索一个字符串，并进行替换，其中包含一个随着每次匹配而递增的数字。

要“找到”的字符串可以是单个字符、单词或短语。

替换表达式并不总是相同的（如下面的示例所示），但总是包含一个递增的数字（变量）。

例如：

1）我有一个名为“data.txt”的测试文件。该文件包含：

Now is the time
for all good men
to come to the
aid of their party.

2) 我将 awk 脚本放在名为“cmd.awk”的文件中。该文件包含：

/f/ {sub ("f","f(" ++j ")")}1

3）我像这样使用awk：

awk -f cmd.awk data.txt

在这种情况下，输出如预期：

Now is the time
f(1)or all good men
to come to the
aid of(2) their party.

当一条线上有多个匹配时，问题就来了。例如，如果我正在搜索字母“i”，例如：

/i/ {sub ("i","i(" ++j ")")}1

输出是：

Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.

这是错误的，因为它不包括“时间”或“他们”中的“我”。

所以，我尝试了“gsub”而不是“sub”，比如：

/i/ {gsub ("i","i(" ++j ")")}1

输出是：

Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.

现在它替换了所有出现的字母“i”，但插入的数字对于同一行上的所有匹配项都是相同的。

所需的输出应该是：

Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.

注意：数字并不总是以“1”开头，所以我可能会像这样使用 awk：

awk -f cmd.awk -v j=26 data.txt

要获得输出：

Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.

为了清楚起见，替换中的数字并不总是在括号内。并且替换并不总是包含匹配的字符串（实际上它非常罕见）。

我遇到的另一个问题是......

我想为“搜索字符串”使用 awk 变量（不是环境变量），所以我可以在 awk 命令行上指定它。

例如：

1) 我将 awk 脚本放在名为“cmd.awk”的文件中。该文件包含以下内容：

/??a??/ {gsub (a,a "(" ++j ")")}1

2）我会像这样使用awk：

awk -f cmd.awk -v a=i data.txt

要获得输出：

Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.

这里的问题是，如何在 /search/ 表达式中表示变量“a”？

score 2 · Accepted Answer

2

awk 版本：

awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i

于 2013-02-19T13:55:58.110 回答

score 2 · Accepted Answer

gensub()在这里听起来很理想，它允许您替换第 N 个匹配项，因此听起来像一个解决方案的解决方案是在循环中迭代字符串，一次do{}while()替换一个匹配项并递增j. gensub()如果替换不包含原始文本（或更糟糕的是，包含多次），这种简单的方法将不起作用，见下文。

因此，在 awk 中，缺少 perl 的 " s///e" 评估功能及其有状态的正则表达式/g修饰符（由 Steve 使用），剩下的最佳选择是将行分成块（head、match、tail）并将它们重新组合在一起：

BEGIN { 
    if (j=="") j=1
    if (a=="") a="f"
}
match($0,a) { 
    str=$0; newstr=""
    do {
         newstr=newstr substr(str,1,RSTART-1) # head
         mm=substr(str,RSTART,RLENGTH)        # extract match
         sub(a,a"("j++")",mm)                 # replace
         newstr=newstr mm 
         str=substr(str,RSTART+RLENGTH)       # tail
    } while (match(str,a))
    $0=newstr str     
}
{print}

这match()用作表达式而不是// 模式，因此您可以使用变量。（你也可以只使用“ ($0 ~ a) { ... }”，但是match()这段代码中使用了的结果，所以不要在这里尝试。）

您可以在命令行上定义j和。a

gawk支持\y相当于 perlre's \b，并且还支持\<并\>显式匹配单词的开头和结尾，只需注意从 unix 命令行添加额外的转义符（我不太确定 Windows 可能需要或允许什么）。

限定gensub()版

如上所述：

match($0,a) {
    idx=1; str=$0
    do {
        prev=str
        str=gensub(a,a"(" j ")",idx++,prev)
    } while (str!=prev && j++)
    $0=str
}

这里的问题是：

如果您将子字符串“ i”替换为子字符串“ k”或“ k(1)”，则gensub()下一个匹配项的索引将关闭 1。如果您提前知道这一点，或者改为通过字符串向后工作，则可以解决此问题。
如果将子字符串“ i”替换为子字符串“ ii”或“ ii(i)”，则会出现类似的问题（导致无限循环，因为gensub()不断寻找新的匹配项）

稳健地处理这两种情况是不值得的代码。

score 1 · Accepted Answer

我并不是说这不能使用awk.，但我强烈建议转向更强大的语言。改为使用perl。

要包括从 26 开始的字母计数i，请尝试：

perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt

这也可以是一个 shell 变量：

var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt

结果：

Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.

要包含特定单词的计数，请在单词\b周围添加单词边界（即），尝试：

perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt

结果：

Now is the(6) time
for all good men
to come to the(7)
aid of their party.

windows - awk 的 gsub 问题（gawk）

3 回答 3

Related

Reference