bash - 文本匹配的启发式排序

Question

我想订购计算与多个并发文本匹配的匹配度的结果。我想计算部分匹配到文本搜索的搜索集合，例如特定字符、二元组、前缀。

我想使用 bash、awk、命令行工具或单行工具，而无需编写其他脚本。

例如，假设我想按单词中包含的5 个最常见的英语二元组[ th, he, in, er, ] 的计数进行排序：an

带有示例单词表

abashed
abashedly
abashedness
abhenry
abolisher
not

（从grep he /usr/share/dict/words | head -n5，添加了不匹配项）。

我要输出

2 abolisher
1 abhenry
1 abashedness
1 abashedly
1 abashed
0 not

score 2 · Accepted Answer

对于“按元音数量排序”的特定问题，GNU awk 是一个不错的选择：

produce_words |
gawk '
  {
    vowels = gensub(/[^aeiouy]/, "", "g", tolower($0))
    count[$0] = length(vowels)
  }
  END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (word in count) print count[word], word
  }
'

有关PROCINFO 魔法，请参阅使用 gawk 使用预定义的数组扫描顺序。

score -2 · Accepted Answer

awk 可以。

计算匹配某些模式的行数，可能是多个。因为可以匹配多个模式，所以我们不能/in|er/在解决方案中使用正则表达式替代匹配 ( )。

您可以将其写在一行中，尽管它非常重复。

<10-words.txt tr A-Z a-z 
| awk '//{tot[$0]=0}
    /th/{tot[$0]++}
    /he/{tot[$0]++}
    /in/{tot[$0]++} 
    /er/{tot[$0]++}
    /an/{tot[$0]++}
  END{for (i in tot) print tot[i],i }'
| sort -rn`

bash - 文本匹配的启发式排序

2 回答 2

Related

Reference