答案 #1 - 基于字符串/变量的解决方案
假设所需的字符串存储在变量str
中,这是一种awk
解决方案:
awk -v str="${str}" '
BEGIN { num = split(str,token,"") # split str into an array of single letter/number elements
for ( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
count[token[i]]++
}
min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # keep track of the lowest/minimum count
}
for ( i=1; i<=num; i++ ) { # loop through array of letter/numbers
if ( min == count[token[i]] ) { # for the first letter/number we find where count = min
print token[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}'
针对不同的示例字符串运行上述代码:
++++++++++++++++ str = aa
a 2
++++++++++++++++ str = aa1
1 1
++++++++++++++++ str = aa1c1deef
c 1
++++++++++++++++ str = abcdeeddAbac
A 1
++++++++++++++++ str = abcdeeddAbacA
a 2
++++++++++++++++ str = abcdeeddAbacAabc
e 2
++++++++++++++++ str = axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
g 1
答案 #2 - 基于文件/数组的解决方案
查看评论 OP 对 RavinderSingh13 的回答 re:一个非常大的字符串驻留在一个文件中,并假设该文件的名称是giga.txt
......
我们应该能够对以前的awk
解决方案进行一些小修改,如下所示:
awk '
BEGIN { RS = "\0" } # address files with no cr/lf
{ num = split($0,token,"") # split line/$0 into an array of single letter/number elements
for( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
all[NR i] = token[i] # token array is for current line/$0 while all array is for entire file
count[token[i]]++
}
}
END { min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}
' giga.txt
将较长的str
样本放入giga.txt
:
$ cat giga.txt
axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
运行上述awk
解决方案giga.txt
给我们:
$ awk '....' giga.txt
g 1
答案 #3 - 基于文件/substr() 的解决方案
OP 提供了有关如何生成“大”数据文件的更多详细信息:
$ ls lR / > giga.txt # I hit ^C after ~20 secs
$ sed "s/\(.\)/\1\n/g" giga.txt | grep -o [a-zA-Z0-9] | tr -d '\012' > newgiga.txt # remove all but letters and numbers
这给了我一个 1400 万字符的文件 ( newgiga.txt
)。
我针对 1400 万个字符的文件运行了几个计时测试以及一个新的awk
解决方案(见下文),并得出了以下计时:
- 基于文件/数组的解决方案需要 15 秒
awk
(请参阅我之前的答案 - 上面)
- 25 秒 OP
sed/grep/echo/uniq/tr/sort
回答
- 使用 RavinderSingh13 的
awk
解决方案超过 4 分钟(实际在 4 分钟后点击 ^C)
- 6 秒,使用基于新文件/substr()的
awk
解决方案(见下文)
注意:对于针对我的特定newgiga.txt
文件运行的所有解决方案,最终答案是字母Z
(出现 365 次)。
split/array
通过用一系列调用替换代码substr()
,并对all
数组的索引方式进行小幅更改,我能够将之前基于文件/数组的awk
解决方案的运行时间缩短约 60%:
awk '
BEGIN { RS = "\0" }
{ len=length($0)
for( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token=substr($0,i,1)
a++
all[a] = token # token array is for current line/$0 while all array is for entire file
count[token]++
}
}
END { min=10000000
for( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
注意:老实说,我没想到substr()
调用会比split/array
方法快,但我猜awk
有一个非常快的内置方法来运行substr()
调用。