awk 频道中有一段名为“ FindAllMatches”的标准代码,但它仍然非常手动,从字面上看,只是长循环while(), match(), substr(), more substr(),然后冲洗并重复。
如果您正在寻找有关如何仅获取匹配部分的想法,但是对于每行匹配多次或根本不匹配的复杂正则表达式,请尝试以下操作:
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) { 
    alnumstr = sprintf("%s%c", alnumstr , x) 
 }; 
 gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr) 
                       
                    # resulting str should be 44-chars long :
                    # all digits, non-vowels, equal sign =, and underscore _
 x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
 } while ( --x );   # you can pick any level of precision you need.
                    # 10 chars randomly among the set is approx. 54-bits 
                    #
                    # i prefer this set over all ASCII being these 
                    # just about never require escaping 
                    # feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
                    #
                    # now you've made a random nonce that can be 
                    # inserted right in the middle of just about ANYTHING
                    # -- ASCII, Unicode, binary data -- (1) which will always fully
                    # print out, (2) has extremely low chance of actually
                    # appearing inside any real word data, and (3) even lower chance
                    # it accidentally alters the meaning of the underlying data.
                    # (so intentionally leaving them in there and 
                    # passing it along unix pipes remains quite harmless)
                    #
                    # this is essentially the lazy man's approach to making nonces
                    # that kinda-sorta have some resemblance to base64
                    # encoded, without having to write such a module (unless u have
                    # one for awk handy)
    regex1 = (..);  # build whatever regex you want here
    FS = OFS = nonceFS;
 } $0 ~ regex1 { 
    gsub(regex1, nonceFS "&" nonceFS); $0 = $0;  
                   # now you've essentially replicated what gawk patsplit( ) does,
                   # or gawk's split(..., seps) tracking 2 arrays one for the data
                   # in between, and one for the seps.
                   #
                   # via this method, that can all be done upon the entire $0,
                   # without any of the hassle (and slow downs) of 
                   # reading from associatively-hashed arrays,
                   # 
                   # simply print out all your even numbered columns
                   # those will be the parts of "just the match"
如果您还运行另一个OFS = ""; $1 = $1;, 现在不需要 4 个参数split()or patsplit(),这两个参数都是 gawk 特定的以查看正则表达式 seps 是什么,现在整个$0's 字段都在 data1-sep1-data2-sep2-.... 模式, ..... 一直$0看起来与您第一次阅读该行时完全相同。直接向上print将逐字节地与读取时立即打印相同。
一旦我使用代表有效UTF8字符的正则表达式对其进行了极端测试。mawk2 大约花了 30 秒左右的时间来处理一个 167MB 的文本文件,其中包含大量的 CJK unicode,一次全部读入 $0,然后启动这个拆分逻辑,导致 NF 约为 175,000,000,每个字段都是 1-single ASCII 或多字节 UTF8 Unicode 字符。