regex - 正则表达式拆分字符串并在 awk 中保留分隔符

Question

据我所知，如果我想用正则表达式拆分字符串，并将分隔符保留在Perl、JavsScript或PHP中，我应该在正则表达式中使用捕获括号/组；例如在 Perl 中（我想在一个数字和右括号处拆分）：

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" \
| perl -ne 'print join("--", split(/(\d\))/,$_));'
123.123   --1)--  234.234
345.345   --0)--  456.456

我正在尝试相同的技巧awk，但它看起来不起作用（如，分隔符仍然“吃掉”，即使使用了捕获组/括号）：

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" \
| awk '{print; n=split($0,a,/([0-9]\))/);for(i=1;i<=n;i++){print i,a[i];}}'
123.123   1)  234.234
1 123.123   
2   234.234
345.345   0)  456.456
1 345.345   
2   456.456

可以awk强制将分隔符匹配保留在拆分结果的数组中吗？

score 3 · Accepted Answer

您可以split()在 gawk 中使用，例如

echo -e "123.123   1)  234.234\n345.345   0)  456.456" |
gawk '{
    nf = split($0, a, /[0-9]\)/, seps)
    for (i = 1; i < nf; ++i) printf "%s--%s--", a[i], seps[i]
    print a[i]
}'

输出：

123.123   --1)--  234.234
345.345   --0)--  456.456

GNU awk (gawk) 中的函数版本接受另一个可选的数组名称参数，如果存在该参数，则将匹配的分隔符保存到数组中。

如 Gawk 手册中所述：

split(s, a [, r [, seps] ])

Split the string s into the array a and the separators array seps on the regular expression r, and return the number of
fields.  If r is omitted, FS is used instead.  The arrays a and seps are cleared first.  seps[i] is the field separator
matched by r between a[i] and a[i+1].  If r is a single space, then leading whitespace in s goes into the extra array element
seps[0] and trailing whitespace goes into the extra array element seps[n], where n is the return value of split(s, a, r,
seps).  Splitting behaves identically to field splitting, described above.

score 0 · Accepted Answer

正如@konsolebox 提到的，您可以将 split() 与较新的 gawk 版本一起使用来保存字段分隔符值。您还可以查看 FPAT 和 patsplit()。另一种选择是将 RS 设置为您当前的 FS，然后使用 RT。

话虽如此，我不明白您为什么要考虑涉及字段分隔符的解决方案，而您可以在 gawk 中仅使用 gensub() 解决您发布的问题：

$ echo -e "123.123   1)  234.234\n345.345   0)  456.456" |
gawk '{print gensub(/[[:digit:]])/,"--&--","")}'
123.123   --1)--  234.234
345.345   --0)--  456.456

如果您确实要解决其他需要记住 FS 值的问题，请告诉我们，我们可以为您指明正确的方向。

regex - 正则表达式拆分字符串并在 awk 中保留分隔符

2 回答 2

Related

Reference