awk 频道中有一段名为“ FindAllMatches
”的标准代码,但它仍然非常手动,从字面上看,只是长循环while()
, match()
, substr()
, more substr()
,然后冲洗并重复。
如果您正在寻找有关如何仅获取匹配部分的想法,但是对于每行匹配多次或根本不匹配的复杂正则表达式,请尝试以下操作:
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
如果您还运行另一个OFS = ""; $1 = $1;
, 现在不需要 4 个参数split()
or patsplit()
,这两个参数都是 gawk 特定的以查看正则表达式 seps 是什么,现在整个$0
's 字段都在 data1-sep1-data2-sep2-.... 模式, ..... 一直$0
看起来与您第一次阅读该行时完全相同。直接向上print
将逐字节地与读取时立即打印相同。
一旦我使用代表有效UTF8字符的正则表达式对其进行了极端测试。mawk2 大约花了 30 秒左右的时间来处理一个 167MB 的文本文件,其中包含大量的 CJK unicode,一次全部读入 $0,然后启动这个拆分逻辑,导致 NF 约为 175,000,000,每个字段都是 1-single ASCII 或多字节 UTF8 Unicode 字符。