regex - 如何使用 sed、awk 或 gawk 仅打印匹配的内容？

Question

我看到很多关于如何使用 sed、awk 或 gawk 进行搜索和替换等操作的示例和手册页。

但就我而言，我有一个正则表达式，我想针对文本文件运行以提取特定值。我不想做搜索和替换。这是从 bash 调用的。让我们举个例子：

正则表达式示例：

.*abc([0-9]+)xyz.*

示例输入文件：

a
b
c
abc12345xyz
a
b
c

听起来很简单，但我无法弄清楚如何正确调用 sed/awk/gawk。我希望做的是在我的 bash 脚本中：

myvalue=$( sed <...something...> input.txt )

我尝试过的事情包括：

sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

score 45 · Accepted Answer

我的sed(Mac OS X) 不适用于+. 我尝试*了，并添加p了打印匹配的标签：

sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt

为了匹配至少一个没有的数字字符+，我会使用：

sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

score 39 · Accepted Answer

您可以使用 sed 来执行此操作

 sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'

-n不打印结果行
-r这使得你没有逃脱捕获组parens ()。
\1捕获组匹配
/g全局匹配
/p打印结果

我为自己编写了一个工具，使这更容易

rip 'abc(\d+)xyz' '$1'

score 18 · Accepted Answer

I use perl to make this easier for myself. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'

This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.

The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).

You can do this will multiple file names on the end also. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

score 5 · Accepted Answer

如果您的版本grep支持它，您可以使用该选项仅-o打印与您的正则表达式匹配的任何行的部分。

如果没有，那么这是sed我能想到的最好的：

sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

...删除/跳过没有数字的行，并且对于剩余的行，删除所有前导和尾随非数字字符。（我只是猜测您的意图是从包含一个的每一行中提取数字）。

类似的问题：

sed -e 's/.*\([0-9]*\).*/&/'

.... 或者

sed -e 's/.*\([0-9]*\).*/\1/'

...是sed只支持“贪婪”匹配...所以第一个 .* 将匹配该行的其余部分。除非我们可以使用否定字符类来实现非贪婪匹配......或者sed与 Perl 兼容的版本或对其正则表达式的其他扩展，否则我们无法从模式空间中提取精确的模式匹配（一行）。

score 5 · Accepted Answer

您可以使用awkwithmatch()访问捕获的组：

$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345

这试图匹配模式abc[0-9]+xyz。如果这样做，它将其切片存储在数组中matches，数组的第一项是块[0-9]+。由于match() 返回该子字符串开始的字符位置或索引（1，如果它从字符串的开头开始），它会触发print操作。

您可以使用grep后视和前瞻：

$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345

$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345

[0-9]+这会在模式出现时检查模式abc，xyz并且只打印数字。

score 2 · Accepted Answer

perl 是最简洁的语法，但如果您没有 perl（我理解并非总是存在），那么使用 gawk 和正则表达式组件的唯一方法是使用 gensub 功能。

gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file

示例输入文件的输出将是

注意： gensub 替换整个正则表达式（在 // 之间），因此您需要在 ([0-9]+) 之前和之后放置 .* 以去除替换中数字之前和之后的文本。

score 1 · Accepted Answer

If you want to select lines then strip out the bits you don't want:

egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'

It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.

You can see this in action here:

pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>

Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:

egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

score 1 · Accepted Answer

OP 的案例没有指定单行可以有多个匹配项，但是对于 Google 流量，我也会为此添加一个示例。

由于 OP 需要从模式中提取一个组，因此使用grep -o将需要 2 遍。但是，我仍然认为这是完成工作的最直观的方式。

$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT

$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz

$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512

由于处理器时间基本上是免费的，但人类的可读性是无价的，我倾向于基于以下问题重构我的代码：“一年后，我认为这会做什么？” 事实上，对于我打算公开或与我的团队共享的代码，我什至会开放man grep以弄清楚长选项是什么并替换它们。像这样：grep --only-matching --extended-regexp

score 0 · Accepted Answer

为什么甚至需要匹配组

gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'

让FS收走线路的两端。

如果 $2（FS 没有吞下的剩余部分）不包含非数字字符，那就是您打印出来的答案。

如果您格外谨慎，请确认 1 美元和 3 美元的长度都为零。

** 实现零长度后编辑的答案 $2 会绊倒我以前的解决方案

score 0 · Accepted Answer

awk 频道中有一段名为“ FindAllMatches”的标准代码，但它仍然非常手动，从字面上看，只是长循环while(), match(), substr(), more substr()，然后冲洗并重复。

如果您正在寻找有关如何仅获取匹配部分的想法，但是对于每行匹配多次或根本不匹配的复杂正则表达式，请尝试以下操作：

mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) { 

    alnumstr = sprintf("%s%c", alnumstr , x) 
 }; 
 gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr) 
                       
                    # resulting str should be 44-chars long :
                    # all digits, non-vowels, equal sign =, and underscore _

 x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)

 } while ( --x );   # you can pick any level of precision you need.
                    # 10 chars randomly among the set is approx. 54-bits 
                    #
                    # i prefer this set over all ASCII being these 
                    # just about never require escaping 
                    # feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
                    #
                    # now you've made a random nonce that can be 
                    # inserted right in the middle of just about ANYTHING
                    # -- ASCII, Unicode, binary data -- (1) which will always fully
                    # print out, (2) has extremely low chance of actually
                    # appearing inside any real word data, and (3) even lower chance
                    # it accidentally alters the meaning of the underlying data.
                    # (so intentionally leaving them in there and 
                    # passing it along unix pipes remains quite harmless)
                    #
                    # this is essentially the lazy man's approach to making nonces
                    # that kinda-sorta have some resemblance to base64
                    # encoded, without having to write such a module (unless u have
                    # one for awk handy)


    regex1 = (..);  # build whatever regex you want here

    FS = OFS = nonceFS;

 } $0 ~ regex1 { 

    gsub(regex1, nonceFS "&" nonceFS); $0 = $0;  

                   # now you've essentially replicated what gawk patsplit( ) does,
                   # or gawk's split(..., seps) tracking 2 arrays one for the data
                   # in between, and one for the seps.
                   #
                   # via this method, that can all be done upon the entire $0,
                   # without any of the hassle (and slow downs) of 
                   # reading from associatively-hashed arrays,
                   # 
                   # simply print out all your even numbered columns
                   # those will be the parts of "just the match"

如果您还运行另一个OFS = ""; $1 = $1;, 现在不需要 4 个参数split()or patsplit()，这两个参数都是 gawk 特定的以查看正则表达式 seps 是什么，现在整个$0's 字段都在 data1-sep1-data2-sep2-.... 模式, ..... 一直$0看起来与您第一次阅读该行时完全相同。直接向上print将逐字节地与读取时立即打印相同。

一旦我使用代表有效UTF8字符的正则表达式对其进行了极端测试。mawk2 大约花了 30 秒左右的时间来处理一个 167MB 的文本文件，其中包含大量的 CJK unicode，一次全部读入 $0，然后启动这个拆分逻辑，导致 NF 约为 175,000,000，每个字段都是 1-single ASCII 或多字节 UTF8 Unicode 字符。

score -1 · Accepted Answer

你可以用外壳来做

while read -r line
do
    case "$line" in
        *abc*[0-9]*xyz* ) 
            t="${line##abc}"
            echo "num is ${t%%xyz}";;
    esac
done <"file"

score -3 · Accepted Answer

对于 awk。我会使用以下脚本：

/.*abc([0-9]+)xyz.*/ {
            print $0;
            next;
            }
            {
            /* default, do nothing */
            }

score -3 · Accepted Answer

-3

gawk '/.*abc([0-9]+)xyz.*/' file

于 2009-11-14T09:18:02.227 回答

regex - 如何使用 sed、awk 或 gawk 仅打印匹配的内容？

13 回答 13

Related

Reference